Tiny Intro Toolbox Thread

category: code [glöplog]

now this is irrelevant to this discussion, but please get your scale right.

a single core of a sandy bridge i7 at 3.4GHz can do 8 single-precision adds and 8 single-precision muls per clock (or half that for DP) via AVX, which gives 54.4 GFLOPS/s per core, which makes about 217.6 single precision GFLOPS/s for a quad-core (theoretical). in practice they can utilize over 90% of this over sustained periods for the right kernels (e.g. the BLAS SGEMM kernel, cf. figure 3 in http://ft.ornl.gov/~dol/papers/cf12_llano.pdf). that's a fast modern cpu, but it's a year old and there's higher-end versions with 6 cores.

modern high-end GPUs have theoretical peaks of about ~3100 GFLOPS/s (GeForce GTX 680) or ~3800 GFLOPS/s (Radeon HD 7970). the best GPU pure-compute kernels i know for the same task (sgemm=matrix multiply which is all multiply-adds with a favorable ratio of arithmetic to memory accesses, so this is not particularly biased towards general-purpose CPUs) max out at slightly below 75% compute utilization - cf. for example http://www.dcs.warwick.ac.uk/~sdh/pmbs10/pmbs10/Workshop_Programme_files/fastgemm.pdf (this is last-gen GPUs though).

put all this together and you get a ratio of GPU flops (actual) over CPU flops (actual) of about 14x. note this is comparing the best available high-end GPUs (though still single-GPU solutions) vs. a very good quad-core CPU (not the best you can get).

even comparing theoretical max GPU flops to actual CPU flops, you get less than 20x. that's a very respectable speedup from going to GPU, but 3 to 5 orders of magnitude it ain't.

anyway, back to the original subject: if you're running in dosbox, it's gonna be slow. but the good thing is you really don't have that many pixels to render. :)

added on the 2012-05-18 07:25:57 by ryg

ryg, you are perfectly right about "this is irrelevant", in so many different ways :)

First, what AVX has to do with tiny intro toolbox thread... With huge opcodes and poor math support, even simple stuff like single sin/cos calculation would take more bytes than a typical tiny intro.

Second, I'm happy about pumped up cpus with useless gflops for add-mul operations. But modern intros are on the higher level than pocket calculator math. Back to the topic of raymarching, who needs adds and muls? I, for one, prefer sin/cos as building blocks which have nice gradient properties for distance fields. Such math functions take hundreds of cycles on cpus which gives another few orders of magnitude slowdowns compared to gpus.

And it would be interesting to see source code of AVX raymarching example in action, really :)

added on the 2012-05-18 15:44:17 by Digimind

digimind: wow, you really have no idea how any of this actually works, do you? nevermind.

avx raymarching/raytracing: http://ispc.github.com/perf.html.

added on the 2012-05-18 17:02:26 by ryg

To get some idea, I visited your link:

Basic raytracing of some columns and arcs runs at 0.15 fps (serial version) or 3 fps (3+ GHz, AVX, 4 cores). It's a triangle set, but with hierarchical optimizations. Anyway, it's nowhere near hundreds of fps for gpu raytracing scenes of similar complexity.

Volume raymarching of some smoke:
Extremely low resolution 48x64x48 I was ashamed to benchmark.
Low resolution 192x256x192 takes minutes per frame and I don't want to convert this to fps.
Medium resolution data is missing because 192 voxels is already considered "highres" by intel.

added on the 2012-05-18 19:06:04 by Digimind

wow, i'm always wondering about new Digimind's production)

added on the 2012-05-19 05:31:33 by Android Barker

Is fcomi supported in DOSBox?

added on the 2012-05-19 15:31:48 by las

Hundreds of fps / 3 fps = 10,000 speed up!

added on the 2012-05-19 16:18:17 by texel

100x-10000x speedup was indicated for fp calculations (depending on instruction), not for the final fps which can be smaller due to overhead and obvious amortization.
It's relatively easy to build a toolbox for accurate benchmarking of calculations (with proper dependency chains) to see for example sin() speedup of 6000x or sqrt() speedup of only 2300x, etc.

added on the 2012-05-19 17:04:58 by Digimind

Are there any nice tools for palette testing / generation or whatever?
I read something that optimus did something in that direction.

btw: no fcomi in DOSBox.

added on the 2012-05-19 20:22:39 by las

Just write your own thingie displaying a bar with all colors and play around creating a palette you need (= But if Optimus did something like that I'd grab it too :D

added on the 2012-05-19 20:38:45 by sensenstahl

My current idea needs too many fpu instructions and that's total pain... but this thing has to be squished to 256b. I guess it will take a while - maybe some parts have to be done fixed point... I could definitely need some proper vector instructions. Also already though about a very hacky "vm style" approach to my problem - but 256b is really fucking tiny :D

added on the 2012-05-19 20:56:10 by las

what parts are too big? how big is your code right now?

added on the 2012-05-19 20:58:23 by ryg

I reached 290 bytes and that was without the "core" code. I guess I'll need ~360 bytes or something like that and then I'll see what can be killed or turned into something smaller.
The code is already quite heavily size optimized (imho). One of the major problems is that I have quite a few floating point compares - and fcomp fstsw ax sahf is not very tiny - maybe turning that stuff into 16 bit fixed point wouldn't be such a bad idee. On the other side - there are some things having the fpu wouldn't be that bad - cos/sin and I still have to find something to get rid of the acos.
And yes - I'm obviously converting a sphere tracing shader.

added on the 2012-05-19 21:25:02 by las

y u sphere tracing?

man up and brute-force that shit! :) just do fixed-step ray marching. worked just fine for lattice, 11 years ago (at 32 iterations per pixel).

added on the 2012-05-19 21:35:14 by ryg

actually... you've got quite a point there and fuck correct perspective - seems I over-engineered that thing a bit.

added on the 2012-05-19 22:01:47 by las

Don't delete this version. Include it later as bonus in your release (=

added on the 2012-05-19 22:03:13 by sensenstahl

las: that's the fun part about <=256b. there's just not enough space to get fancy in any way. :)

added on the 2012-05-20 02:06:10 by ryg

[this thread is educational, thank you las for opening]

added on the 2012-05-20 02:43:20 by neu / metoikos

las: you've probably already done this, but you NEED to check out how things like the rotation is done in Spongy/TBC. Really smart use of pusha/popa and using the stack to iterate through register values.

added on the 2012-05-20 10:13:41 by ferris