optimization techniques of the rich and famous

category: code [glöplog]

@sesse yeh that is true, but for this case I think he said he read them as a uint16 ptr directly, it is likely an 8bit bitmap/buffe with the src pixels already adjacent and stored. So it saves a bit more than just one load (eg an or and a shuft, or whatever way the baseline extracted out the individual pixels and then combined after the lut:s).

Btw, optimus, might be worth trying to read a uint32 ptr (if pixels are adjacent in this way), extract two 16b values and then do two luts. As for 10x improvements, depends. For oldskool, checkout fatmap2.txt, tiled textures is something many missed to do (eg quake(!!)) but that helps and is exactly same principle as the voxel. But itndoesntnhave to be too fancy, the key is: traverse stuff in a way to stay within the cache line. So if you move a bit in u or v, you want to hit same cache line. My rasterizer got a nice speedup by doing tiled texturing.

@bitl what was your workflow for optimizing your mindblowing 486 demo?

added on the 2025-09-18 00:29:58 by noglin

comment out half your code and your demo will run two times faster!

added on the 2025-09-18 10:15:30 by maalinstrippari

Additional to the tricks and methods mentioned in this thread, there are older diskmag articles (Imphobia had some, but also Hugi and others). And somewhere there should also be some good old texts flying around on the web. I remember some about affine texture mapping, or tight texturemapping inner loops (for example about adding 2 values in parallel in a 32b word on Amiga, like SIMD) by Azure.

Here are some links:
https://www.hugi.scene.org/online/coding/hugi%20se%204%20-%20index%20sorted%20by%20topic.htm
https://github.com/othieno/GPBB
https://amycoders.org/

added on the 2025-09-18 10:17:07 by Dresdenboy

ah those are great links dresdenboy! thanks!!

I also pack 2 values in 32b and do single add in innerloop of my rasterizer, one must leave gaps between the two and clear the gap bit if accumulate, but with carefully placed ones and zeros one can also combine this to get swizzling and get the ptr calculation for free too (on ppc we have the rather non-risc rlwimi/rlwmn instructions that helps a lot for this)

Are you thinking of fatmap2.txt by MRI from Doomsday, perhaps? (but is x86) https://www.multi.fi/~mbc/sources/fatmap2.txt

added on the 2025-09-18 11:39:30 by noglin

Quote:

@bitl what was your workflow for optimizing your mindblowing 486 demo?

@noglin: Thank you for the kind words about the demo.

My workflow looks something like this:

- coming up with the concept of the effect in my head

- writing prototype algorithm in Pascal, just to see if the idea works at all.
(estimating performance in DOSBox by setting the number of cycles to roughly match the hardware I’m targeting. Yes, DOSBox won’t reflect the real performance exactly, but it’s enough to get a first idea and understand whether the concept is feasible.)

- long thinking and experimenting, until the architecture of the algorithm and the visual result seem good enough to move on to the next step

- rewriting critical parts in assembler, using my hands-on experience in the challenge of doing it better than the compiler

- experimenting on real hardware: testing different ideas, writing multiple code variants and checking which one gives the best FPS. Down to testing a single instruction or different execution orders. Sometimes completely counterintuitive things turn out to be faster.

- rewriting less critical but still important parts in assembler, trying to squeeze out another 1–2 FPS

- the code starts haunting me in dreams, I try to compile and run it in my sleep, then wake up and realize it doesn’t work that way. Seems like I need to take a break and switch to another demo effect.

- after some time I come back to this code and find a way to make it faster. And since I gained a few extra FPS, I start thinking what more I can add to the visuals.

- Every fifth iteration, I smack my forehead with my palm and say "Damn it, I’m so stupid! This could have been done much simpler!"

- and so on, many times :)

P.S. Yes, of course, many effects in Demoded are not groundbreaking - algorithms like 2D bump mapping or sinus plasma have been around for ages. But I usually don’t study other people’s code (at least because I don’t enjoy digging into someone else’s sources). I prefer to reinvent the effect, implement it the way I understand its concept, and along the way make it as fast as possible while adding some personal embellishments.

So it’s quite possible that I’m doing some things in a really clumsy, less rational way than it was done in the old demos.

added on the 2025-09-18 11:40:12 by bitl

Regarding the details, it’s mostly what’s already been said in this thread.

Operations on four 8-bit numbers using a single 32-bit instruction, bit rotation shuffles, etc. - it’s like solving a puzzle in the minimum number of steps :)

Bitwise tricks, fixed-point arithmetic. And just little flashes of insight. That’s what makes coding in assembly worth it in the first place.

Splitting the render into “tiles”: for example, pseudo-3D spheres in Demoded are rendered in 16x16 pixel blocks, which makes cache hits more even, regardless of the angle the map is scanned from (otherwise FPS jumps from 20 to 60 depending on the sphere’s rotation).

Loop unrolling. I don’t know about Pentiums, but it still works on a 486.

Modifying the inner loop code at runtime (before entering the loop): for example, inserting numeric values directly into instructions in memory. This trick helps a lot, since registers are always limited, and accessing a variable in memory isn’t always convenient and is usually slower.
(The main thing after such a modification is to make a formal jump, for reset the code prefetch of CPU.)

But, as I see it, the decisive factor is still design ingenuity. I often watch various demos on Amiga, Atari, C64, etc., spot a clever effect that probably doesn’t take much effort (at least on PC) and I envy it: damn, that’s just awesome!

P.S.: Sorry for the long post :)

added on the 2025-09-18 13:02:58 by bitl

Quote:

Bitwise tricks, fixed-point arithmetic. And just little flashes of insight. That’s what makes coding in assembly worth it in the first place.

Yup. The reason why assembler works isn't that the compiler sucks (because it doesn't, it's far better than most programmers on compiling C). It is because you can do fundamentally different things in assembler sometimes. Also, you can beat it on size if that's important.

added on the 2025-09-18 14:16:37 by Sesse

thx bitl, nice workflow! loved the long post! "the code starts haunting me in dreams,..." :D

"Operations on four 8-bit numbers using a single 32-bit instruction"

hmmm... 1x 8bit, and 3x 7bit, fine, but how would you do it with 4x 8-bit without polluting the other vars with the "carry" if/when they overflow?

added on the 2025-09-18 16:50:30 by noglin

Interesting that 32bit, interpolate both U/V in one add for texture mapping, I considered it before but never seriously tried as I barely had space to not overflow bits nearby unless I changed the fractional bits. And then I'd have to do extra work to retrieve both separate for texture indexiing, unless everything was setup so the final 32bit value maps well or with little shift. Might try in the future.

But for 4 8bit pixels at once I used it in other cases pretty well.
Like if I do plasma where I usually add 3 sine LUTs, my sine ranges are low (like on 0-63) so that the overall add don't overflow to next bytes. I then for isin[x] read 32bit value at &isin[x] so it's like I pick x,x+1,x+2,x+3 and add them together with others.

Or with blur, a 4 nearby pixels blur I fill my buffer with values no bigger than 0-63. But then if you add 4 times and shift right by 2, the lower bits might polute next byte cells, so I also AND with 00111111 00111111 00111111 00111111.

And every other effect where the trick is useful, it can be designed around the effect ranges. I was reading the Hacker Delight and found out an ABS on 4 bytes once. I remember the water ripples were adding four pixels but also subtracting and I ether had to clamp to zero or abs to not overflow with noise before, so I never consider this method. But now I did this 32bit add for 4 pixels at once, plus AND to not polute bits, plus the hacker's delight ABS trick, and I had a 320x200 water ripple on 386 going from low 7fps to 21fps! I used it later for our Pentium demo for a 512x256 water buffer (really the lighter part of that final water ripple scene :)

added on the 2025-09-18 17:13:31 by Optimus

Quote:

hmmm... 1x 8bit, and 3x 7bit, fine, but how would you do it with 4x 8-bit without polluting the other vars with the "carry" if/when they overflow?

Well, Optimus has basically already said it all.

Yeah, such tricks aren’t always appropriate, it’s a matter of compromise. Either our effect allows overflow with carry and it doesn’t break the visuals, or we choose value ranges that will never cause overflow. That’s the simplest solution :)
However…
Addition with saturation (the one you’re describing) is covered in Hugi:
https://www.hugi.scene.org/online/coding/hugi%2018%20-%20cobrad.htm

https://hugi.scene.org/online/hugi18/cosatadd.htm

Subtraction with saturation down to zero, for values up to 128, I also know, but that idea belongs to wbcbz7, and it’s not my place to reveal it :)

And of course, on Pentium with MMX/SSE, saturation tricks can be done with a single instruction ;)

added on the 2025-09-18 18:37:48 by bitl

Optimus, I only do it in the innerloop/spans, as it needs less precision (I do perspective texture mapping) the outerloop have them separate. There is extra cost to setup the span, but as I mentioned, I also include the swizzling in it (by carefully placing gap-bits and "roll over" ones, and yes, it also generates an address with one instruction modification for the texture lookup (It was a lot of hairpulling to test that out and come up with a good scheme :D but it would not work straight off for x86 as I rely heavily on the specific instructions on powerpc).

bitl/optimus: I see, yes, this is what I meant with "1x 8bit and 3x 7bit, fine" (there are some bits available for the gaps, implicitly if you constrain the value ranges). I thought maybe there was some x86 trick to still make it work (even if impossible mathematically with 32 bits)... x86 is a weird beast and I don't really know it).

Thanks for the link bitl! I like what you are saying about "maybe the effect allows the overflow". I think I'm overly constraining myself at times to "do it right/correct". Damn it, now I feel like coding demos again but it is still too soon... btw your demo made me want to go get a 486.. and now even more :D

From one of dresdenboys references, just peeked, nice I had never considered that.. who knows what else I'm missing... https://amycoders.org/opt/avoidmuls.html

found Hackers Delight if anyone else would also be interested
https://ia601600.us.archive.org/3/items/random-papers9/Henry%20S.%20Warren%20-%20Hacker%27s%20Delight%20%282nd%20Edition%29-Addison-Wesley%20Professional%20%282012%29.pdf

added on the 2025-09-18 23:58:51 by noglin

I recently found that video of a java optimization contest held last year about processing a billion rows of weather data.

Even if it's java and not demoscene/graphics oriented I found it quite interesting how it touches so much different aspects of speed optimisation...even going down to x86 assembler at one point as the guy is a compiler coder.

Overall you end up with a speed gain of 95% within 10% of your coding time and then the last 5% will take you 90% of your time...that quite fits often...but the 90% are all the fun :-)

Pushing Java to the Limits: Processing a Billion Rows in under 2 Seconds by Thomas Wuerthinger

added on the 2025-09-19 17:05:40 by Kuemmel

pouët.net

optimization techniques of the rich and famous

login