optimization techniques of the rich and famous
category: code [glöplog]
@sesse yeh that is true, but for this case I think he said he read them as a uint16 ptr directly, it is likely an 8bit bitmap/buffe with the src pixels already adjacent and stored. So it saves a bit more than just one load (eg an or and a shuft, or whatever way the baseline extracted out the individual pixels and then combined after the lut:s).
Btw, optimus, might be worth trying to read a uint32 ptr (if pixels are adjacent in this way), extract two 16b values and then do two luts. As for 10x improvements, depends. For oldskool, checkout fatmap2.txt, tiled textures is something many missed to do (eg quake(!!)) but that helps and is exactly same principle as the voxel. But itndoesntnhave to be too fancy, the key is: traverse stuff in a way to stay within the cache line. So if you move a bit in u or v, you want to hit same cache line. My rasterizer got a nice speedup by doing tiled texturing.
@bitl what was your workflow for optimizing your mindblowing 486 demo?
Btw, optimus, might be worth trying to read a uint32 ptr (if pixels are adjacent in this way), extract two 16b values and then do two luts. As for 10x improvements, depends. For oldskool, checkout fatmap2.txt, tiled textures is something many missed to do (eg quake(!!)) but that helps and is exactly same principle as the voxel. But itndoesntnhave to be too fancy, the key is: traverse stuff in a way to stay within the cache line. So if you move a bit in u or v, you want to hit same cache line. My rasterizer got a nice speedup by doing tiled texturing.
@bitl what was your workflow for optimizing your mindblowing 486 demo?
comment out half your code and your demo will run two times faster!
Additional to the tricks and methods mentioned in this thread, there are older diskmag articles (Imphobia had some, but also Hugi and others). And somewhere there should also be some good old texts flying around on the web. I remember some about affine texture mapping, or tight texturemapping inner loops (for example about adding 2 values in parallel in a 32b word on Amiga, like SIMD) by Azure.
Here are some links:
https://www.hugi.scene.org/online/coding/hugi%20se%204%20-%20index%20sorted%20by%20topic.htm
https://github.com/othieno/GPBB
https://amycoders.org/
Here are some links:
https://www.hugi.scene.org/online/coding/hugi%20se%204%20-%20index%20sorted%20by%20topic.htm
https://github.com/othieno/GPBB
https://amycoders.org/
ah those are great links dresdenboy! thanks!!
I also pack 2 values in 32b and do single add in innerloop of my rasterizer, one must leave gaps between the two and clear the gap bit if accumulate, but with carefully placed ones and zeros one can also combine this to get swizzling and get the ptr calculation for free too (on ppc we have the rather non-risc rlwimi/rlwmn instructions that helps a lot for this)
Are you thinking of fatmap2.txt by MRI from Doomsday, perhaps? (but is x86) https://www.multi.fi/~mbc/sources/fatmap2.txt
I also pack 2 values in 32b and do single add in innerloop of my rasterizer, one must leave gaps between the two and clear the gap bit if accumulate, but with carefully placed ones and zeros one can also combine this to get swizzling and get the ptr calculation for free too (on ppc we have the rather non-risc rlwimi/rlwmn instructions that helps a lot for this)
Are you thinking of fatmap2.txt by MRI from Doomsday, perhaps? (but is x86) https://www.multi.fi/~mbc/sources/fatmap2.txt
Quote:
@bitl what was your workflow for optimizing your mindblowing 486 demo?
@noglin: Thank you for the kind words about the demo.
My workflow looks something like this:
- coming up with the concept of the effect in my head
- writing prototype algorithm in Pascal, just to see if the idea works at all.
(estimating performance in DOSBox by setting the number of cycles to roughly match the hardware I’m targeting. Yes, DOSBox won’t reflect the real performance exactly, but it’s enough to get a first idea and understand whether the concept is feasible.)
- long thinking and experimenting, until the architecture of the algorithm and the visual result seem good enough to move on to the next step
- rewriting critical parts in assembler, using my hands-on experience in the challenge of doing it better than the compiler
- experimenting on real hardware: testing different ideas, writing multiple code variants and checking which one gives the best FPS. Down to testing a single instruction or different execution orders. Sometimes completely counterintuitive things turn out to be faster.
- rewriting less critical but still important parts in assembler, trying to squeeze out another 1–2 FPS
- the code starts haunting me in dreams, I try to compile and run it in my sleep, then wake up and realize it doesn’t work that way. Seems like I need to take a break and switch to another demo effect.
- after some time I come back to this code and find a way to make it faster. And since I gained a few extra FPS, I start thinking what more I can add to the visuals.
- Every fifth iteration, I smack my forehead with my palm and say "Damn it, I’m so stupid! This could have been done much simpler!"
- and so on, many times :)
P.S. Yes, of course, many effects in Demoded are not groundbreaking - algorithms like 2D bump mapping or sinus plasma have been around for ages. But I usually don’t study other people’s code (at least because I don’t enjoy digging into someone else’s sources). I prefer to reinvent the effect, implement it the way I understand its concept, and along the way make it as fast as possible while adding some personal embellishments.
So it’s quite possible that I’m doing some things in a really clumsy, less rational way than it was done in the old demos.
Regarding the details, it’s mostly what’s already been said in this thread.
Operations on four 8-bit numbers using a single 32-bit instruction, bit rotation shuffles, etc. - it’s like solving a puzzle in the minimum number of steps :)
Bitwise tricks, fixed-point arithmetic. And just little flashes of insight. That’s what makes coding in assembly worth it in the first place.
Splitting the render into “tiles”: for example, pseudo-3D spheres in Demoded are rendered in 16x16 pixel blocks, which makes cache hits more even, regardless of the angle the map is scanned from (otherwise FPS jumps from 20 to 60 depending on the sphere’s rotation).
Loop unrolling. I don’t know about Pentiums, but it still works on a 486.
Modifying the inner loop code at runtime (before entering the loop): for example, inserting numeric values directly into instructions in memory. This trick helps a lot, since registers are always limited, and accessing a variable in memory isn’t always convenient and is usually slower.
(The main thing after such a modification is to make a formal jump, for reset the code prefetch of CPU.)
But, as I see it, the decisive factor is still design ingenuity. I often watch various demos on Amiga, Atari, C64, etc., spot a clever effect that probably doesn’t take much effort (at least on PC) and I envy it: damn, that’s just awesome!
P.S.: Sorry for the long post :)
Operations on four 8-bit numbers using a single 32-bit instruction, bit rotation shuffles, etc. - it’s like solving a puzzle in the minimum number of steps :)
Bitwise tricks, fixed-point arithmetic. And just little flashes of insight. That’s what makes coding in assembly worth it in the first place.
Splitting the render into “tiles”: for example, pseudo-3D spheres in Demoded are rendered in 16x16 pixel blocks, which makes cache hits more even, regardless of the angle the map is scanned from (otherwise FPS jumps from 20 to 60 depending on the sphere’s rotation).
Loop unrolling. I don’t know about Pentiums, but it still works on a 486.
Modifying the inner loop code at runtime (before entering the loop): for example, inserting numeric values directly into instructions in memory. This trick helps a lot, since registers are always limited, and accessing a variable in memory isn’t always convenient and is usually slower.
(The main thing after such a modification is to make a formal jump, for reset the code prefetch of CPU.)
But, as I see it, the decisive factor is still design ingenuity. I often watch various demos on Amiga, Atari, C64, etc., spot a clever effect that probably doesn’t take much effort (at least on PC) and I envy it: damn, that’s just awesome!
P.S.: Sorry for the long post :)
Quote:
Bitwise tricks, fixed-point arithmetic. And just little flashes of insight. That’s what makes coding in assembly worth it in the first place.
Yup. The reason why assembler works isn't that the compiler sucks (because it doesn't, it's far better than most programmers on compiling C). It is because you can do fundamentally different things in assembler sometimes. Also, you can beat it on size if that's important.
thx bitl, nice workflow! loved the long post! "the code starts haunting me in dreams,..." :D
"Operations on four 8-bit numbers using a single 32-bit instruction"
hmmm... 1x 8bit, and 3x 7bit, fine, but how would you do it with 4x 8-bit without polluting the other vars with the "carry" if/when they overflow?
"Operations on four 8-bit numbers using a single 32-bit instruction"
hmmm... 1x 8bit, and 3x 7bit, fine, but how would you do it with 4x 8-bit without polluting the other vars with the "carry" if/when they overflow?
Interesting that 32bit, interpolate both U/V in one add for texture mapping, I considered it before but never seriously tried as I barely had space to not overflow bits nearby unless I changed the fractional bits. And then I'd have to do extra work to retrieve both separate for texture indexiing, unless everything was setup so the final 32bit value maps well or with little shift. Might try in the future.
But for 4 8bit pixels at once I used it in other cases pretty well.
Like if I do plasma where I usually add 3 sine LUTs, my sine ranges are low (like on 0-63) so that the overall add don't overflow to next bytes. I then for isin[x] read 32bit value at &isin[x] so it's like I pick x,x+1,x+2,x+3 and add them together with others.
Or with blur, a 4 nearby pixels blur I fill my buffer with values no bigger than 0-63. But then if you add 4 times and shift right by 2, the lower bits might polute next byte cells, so I also AND with 00111111 00111111 00111111 00111111.
And every other effect where the trick is useful, it can be designed around the effect ranges. I was reading the Hacker Delight and found out an ABS on 4 bytes once. I remember the water ripples were adding four pixels but also subtracting and I ether had to clamp to zero or abs to not overflow with noise before, so I never consider this method. But now I did this 32bit add for 4 pixels at once, plus AND to not polute bits, plus the hacker's delight ABS trick, and I had a 320x200 water ripple on 386 going from low 7fps to 21fps! I used it later for our Pentium demo for a 512x256 water buffer (really the lighter part of that final water ripple scene :)
But for 4 8bit pixels at once I used it in other cases pretty well.
Like if I do plasma where I usually add 3 sine LUTs, my sine ranges are low (like on 0-63) so that the overall add don't overflow to next bytes. I then for isin[x] read 32bit value at &isin[x] so it's like I pick x,x+1,x+2,x+3 and add them together with others.
Or with blur, a 4 nearby pixels blur I fill my buffer with values no bigger than 0-63. But then if you add 4 times and shift right by 2, the lower bits might polute next byte cells, so I also AND with 00111111 00111111 00111111 00111111.
And every other effect where the trick is useful, it can be designed around the effect ranges. I was reading the Hacker Delight and found out an ABS on 4 bytes once. I remember the water ripples were adding four pixels but also subtracting and I ether had to clamp to zero or abs to not overflow with noise before, so I never consider this method. But now I did this 32bit add for 4 pixels at once, plus AND to not polute bits, plus the hacker's delight ABS trick, and I had a 320x200 water ripple on 386 going from low 7fps to 21fps! I used it later for our Pentium demo for a 512x256 water buffer (really the lighter part of that final water ripple scene :)
Quote:
hmmm... 1x 8bit, and 3x 7bit, fine, but how would you do it with 4x 8-bit without polluting the other vars with the "carry" if/when they overflow?
Well, Optimus has basically already said it all.
Yeah, such tricks aren’t always appropriate, it’s a matter of compromise. Either our effect allows overflow with carry and it doesn’t break the visuals, or we choose value ranges that will never cause overflow. That’s the simplest solution :)
However…
Addition with saturation (the one you’re describing) is covered in Hugi:
https://www.hugi.scene.org/online/coding/hugi%2018%20-%20cobrad.htm
https://hugi.scene.org/online/hugi18/cosatadd.htm
Subtraction with saturation down to zero, for values up to 128, I also know, but that idea belongs to wbcbz7, and it’s not my place to reveal it :)
And of course, on Pentium with MMX/SSE, saturation tricks can be done with a single instruction ;)