## Tiny Intro Toolbox Thread

**category:**code [glöplog]

now this is irrelevant to this discussion, but please get your scale right.

a single core of a sandy bridge i7 at 3.4GHz can do 8 single-precision adds and 8 single-precision muls per clock (or half that for DP) via AVX, which gives 54.4 GFLOPS/s per core, which makes about 217.6 single precision GFLOPS/s for a quad-core (theoretical). in practice they can utilize over 90% of this over sustained periods for the right kernels (e.g. the BLAS SGEMM kernel, cf. figure 3 in http://ft.ornl.gov/~dol/papers/cf12_llano.pdf). that's a fast modern cpu, but it's a year old and there's higher-end versions with 6 cores.

modern high-end GPUs have theoretical peaks of about ~3100 GFLOPS/s (GeForce GTX 680) or ~3800 GFLOPS/s (Radeon HD 7970). the best GPU pure-compute kernels i know for the same task (sgemm=matrix multiply which is all multiply-adds with a favorable ratio of arithmetic to memory accesses, so this is not particularly biased towards general-purpose CPUs) max out at slightly below 75% compute utilization - cf. for example http://www.dcs.warwick.ac.uk/~sdh/pmbs10/pmbs10/Workshop_Programme_files/fastgem m.pdf (this is last-gen GPUs though).

put all this together and you get a ratio of GPU flops (actual) over CPU flops (actual) of about 14x. note this is comparing the best available high-end GPUs (though still single-GPU solutions) vs. a very good quad-core CPU (not the best you can get).

even comparing theoretical max GPU flops to actual CPU flops, you get less than 20x. that's a

anyway, back to the original subject: if you're running in dosbox, it's gonna be slow. but the good thing is you really don't have that many pixels to render. :)

a single core of a sandy bridge i7 at 3.4GHz can do 8 single-precision adds and 8 single-precision muls per clock (or half that for DP) via AVX, which gives 54.4 GFLOPS/s per core, which makes about 217.6 single precision GFLOPS/s for a quad-core (theoretical). in practice they can utilize over 90% of this over sustained periods for the right kernels (e.g. the BLAS SGEMM kernel, cf. figure 3 in http://ft.ornl.gov/~dol/papers/cf12_llano.pdf). that's a fast modern cpu, but it's a year old and there's higher-end versions with 6 cores.

modern high-end GPUs have theoretical peaks of about ~3100 GFLOPS/s (GeForce GTX 680) or ~3800 GFLOPS/s (Radeon HD 7970). the best GPU pure-compute kernels i know for the same task (sgemm=matrix multiply which is all multiply-adds with a favorable ratio of arithmetic to memory accesses, so this is not particularly biased towards general-purpose CPUs) max out at slightly below 75% compute utilization - cf. for example http://www.dcs.warwick.ac.uk/~sdh/pmbs10/pmbs10/Workshop_Programme_files/fastgem m.pdf (this is last-gen GPUs though).

put all this together and you get a ratio of GPU flops (actual) over CPU flops (actual) of about 14x. note this is comparing the best available high-end GPUs (though still single-GPU solutions) vs. a very good quad-core CPU (not the best you can get).

even comparing theoretical max GPU flops to actual CPU flops, you get less than 20x. that's a

*very*respectable speedup from going to GPU, but 3 to 5 orders of magnitude it ain't.anyway, back to the original subject: if you're running in dosbox, it's gonna be slow. but the good thing is you really don't have that many pixels to render. :)

ryg, you are perfectly right about "this is irrelevant", in so many different ways :)

First, what AVX has to do with tiny intro toolbox thread... With huge opcodes and poor math support, even simple stuff like single sin/cos calculation would take more bytes than a typical tiny intro.

Second, I'm happy about pumped up cpus with useless gflops for add-mul operations. But modern intros are on the higher level than pocket calculator math. Back to the topic of raymarching, who needs adds and muls? I, for one, prefer sin/cos as building blocks which have nice gradient properties for distance fields. Such math functions take hundreds of cycles on cpus which gives another few orders of magnitude slowdowns compared to gpus.

And it would be interesting to see source code of AVX raymarching example in action, really :)

First, what AVX has to do with tiny intro toolbox thread... With huge opcodes and poor math support, even simple stuff like single sin/cos calculation would take more bytes than a typical tiny intro.

Second, I'm happy about pumped up cpus with useless gflops for add-mul operations. But modern intros are on the higher level than pocket calculator math. Back to the topic of raymarching, who needs adds and muls? I, for one, prefer sin/cos as building blocks which have nice gradient properties for distance fields. Such math functions take hundreds of cycles on cpus which gives another few orders of magnitude slowdowns compared to gpus.

And it would be interesting to see source code of AVX raymarching example in action, really :)

digimind: wow, you really have no idea how

avx raymarching/raytracing: http://ispc.github.com/perf.html.

*any*of this actually works, do you? nevermind.avx raymarching/raytracing: http://ispc.github.com/perf.html.

To get some idea, I visited your link:

Basic raytracing of some columns and arcs runs at 0.15 fps (serial version) or 3 fps (3+ GHz, AVX, 4 cores). It's a triangle set, but with hierarchical optimizations. Anyway, it's nowhere near hundreds of fps for gpu raytracing scenes of similar complexity.

Volume raymarching of some smoke:

Extremely low resolution 48x64x48 I was ashamed to benchmark.

Low resolution 192x256x192 takes minutes per frame and I don't want to convert this to fps.

Medium resolution data is missing because 192 voxels is already considered "highres" by intel.

Basic raytracing of some columns and arcs runs at 0.15 fps (serial version) or 3 fps (3+ GHz, AVX, 4 cores). It's a triangle set, but with hierarchical optimizations. Anyway, it's nowhere near hundreds of fps for gpu raytracing scenes of similar complexity.

Volume raymarching of some smoke:

Extremely low resolution 48x64x48 I was ashamed to benchmark.

Low resolution 192x256x192 takes minutes per frame and I don't want to convert this to fps.

Medium resolution data is missing because 192 voxels is already considered "highres" by intel.

wow, i'm always wondering about new Digimind's production)

Is fcomi supported in DOSBox?

Hundreds of fps / 3 fps = 10,000 speed up!

100x-10000x speedup was indicated for fp calculations (depending on instruction), not for the final fps which can be smaller due to overhead and obvious amortization.

It's relatively easy to build a toolbox for accurate benchmarking of calculations (with proper dependency chains) to see for example sin() speedup of 6000x or sqrt() speedup of only 2300x, etc.

It's relatively easy to build a toolbox for accurate benchmarking of calculations (with proper dependency chains) to see for example sin() speedup of 6000x or sqrt() speedup of only 2300x, etc.

Are there any nice tools for palette testing / generation or whatever?

I read something that optimus did something in that direction.

btw: no fcomi in DOSBox.

I read something that optimus did something in that direction.

btw: no fcomi in DOSBox.

Just write your own thingie displaying a bar with all colors and play around creating a palette you need (= But if Optimus did something like that I'd grab it too :D

My current idea needs too many fpu instructions and that's total pain... but this thing has to be squished to 256b. I guess it will take a while - maybe some parts have to be done fixed point... I could definitely need some proper vector instructions. Also already though about a very hacky "vm style" approach to my problem - but 256b is really fucking tiny :D

what parts are too big? how big is your code right now?

I reached 290 bytes and that was without the "core" code. I guess I'll need ~360 bytes or something like that and then I'll see what can be killed or turned into something smaller.

The code is already quite heavily size optimized (imho). One of the major problems is that I have quite a few floating point compares - and fcomp fstsw ax sahf is not very tiny - maybe turning that stuff into 16 bit fixed point wouldn't be such a bad idee. On the other side - there are some things having the fpu wouldn't be that bad - cos/sin and I still have to find something to get rid of the acos.

And yes - I'm obviously converting a sphere tracing shader.

The code is already quite heavily size optimized (imho). One of the major problems is that I have quite a few floating point compares - and fcomp fstsw ax sahf is not very tiny - maybe turning that stuff into 16 bit fixed point wouldn't be such a bad idee. On the other side - there are some things having the fpu wouldn't be that bad - cos/sin and I still have to find something to get rid of the acos.

And yes - I'm obviously converting a sphere tracing shader.

y u sphere tracing?

man up and brute-force that shit! :) just do fixed-step ray marching. worked just fine for lattice, 11 years ago (at 32 iterations per pixel).

man up and brute-force that shit! :) just do fixed-step ray marching. worked just fine for lattice, 11 years ago (at 32 iterations per pixel).

actually... you've got quite a point there and fuck correct perspective - seems I over-engineered that thing a bit.

Don't delete this version. Include it later as bonus in your release (=

las: that's the fun part about <=256b. there's just not enough space to get fancy in any way. :)

[this thread is educational, thank you las for opening]

las: you've probably already done this, but you NEED to check out how things like the rotation is done in Spongy/TBC. Really smart use of pusha/popa and using the stack to iterate through register values.

**Quote:**

You don't have permission to access /~stubbe/tbc/tbc_-_spongy_final.zip on this server.

If you have a working download link... :)

unfortunately I don't, but I have the intro and some "nice" form of the disassembly lurking somewhere on my hd..

aka, get on skype :)

:p I'd be curious to see that disassembly of Spongy too. To see how far off I was withJSpongy.

:)

spongy can also be found in the HardCode archive.

spongy can also be found in the HardCode archive.

**Quote:**

spongy can also be found in the HardCode archive.

or on da polish forum. filename are equal to dead link o_O