optimization techniques of the rich and famous
category: code [glöplog]
Hi! I want to make a thread about optimizations. Platform independent. Tell us your tips. Here's mine:
Do things in chunks. If your machine has cache it's way better to keep things in that cache. For example, my nes emulator. I bang out the background and bang out the sprites over it. There's over-draw with the sprites on the background but it's worth it.
When it comes to big reads or writes do it in size of the arch's data-size.
What are your tips!!!!?!?!?
Do things in chunks. If your machine has cache it's way better to keep things in that cache. For example, my nes emulator. I bang out the background and bang out the sprites over it. There's over-draw with the sprites on the background but it's worth it.
When it comes to big reads or writes do it in size of the arch's data-size.
What are your tips!!!!?!?!?
- Speed, size/memory consumption, and simplicity work against each other. Find the best way to measure and balance each of them.
- When prototyping, simple, repeatable and robust is better than fast.
- If computation is slow, make a table! If memory is slow, compute it!
- Don't be too general. Do only the thing you need.
- Get familiar with the hardware. The faster way on one platform may be slow on another.
- When prototyping, simple, repeatable and robust is better than fast.
- If computation is slow, make a table! If memory is slow, compute it!
- Don't be too general. Do only the thing you need.
- Get familiar with the hardware. The faster way on one platform may be slow on another.
- Optimization is less about code and more about data: size, format, layout, transforms, caching, hardware requirements.
- Iterate fast: find ways to try out ideas quickly.
- Find out how others solved a similar problem to get more ideas.
- How do things work under the hood and why? Learn how to read compiler output for different platforms.
- Code that's customized for your particular problem is better optimized than generic code.
- Don't feel attached to your code. It's OK to delete it if you find something better.
- Simple algorithms and data structures are better than fancy ones when data is small. The break-even point is higher than you think.
- Iterate fast: find ways to try out ideas quickly.
- Find out how others solved a similar problem to get more ideas.
- How do things work under the hood and why? Learn how to read compiler output for different platforms.
- Code that's customized for your particular problem is better optimized than generic code.
- Don't feel attached to your code. It's OK to delete it if you find something better.
- Simple algorithms and data structures are better than fancy ones when data is small. The break-even point is higher than you think.
"Don't feel attached to your code". Totally! just because you worked on something a long time doesn't make it good.
- Try everything you can think of; anything is allowed.
- Learn as much boolean logic as you can first.
- Learn the CPU’s instruction set and how it behaves.
- Keep a table of instructions with cycle counts and sizes.
- If you have a puzzle mindset, use it.
- Try alternate techniques; measure and keep the fastest or smallest code.
- Prototype different designs in a higher-level language, then optimize in assembly; keep the fastest or smallest.
- Once the inner loops are optimized, try to reduce the number of loops used.
- Learn as much boolean logic as you can first.
- Learn the CPU’s instruction set and how it behaves.
- Keep a table of instructions with cycle counts and sizes.
- If you have a puzzle mindset, use it.
- Try alternate techniques; measure and keep the fastest or smallest code.
- Prototype different designs in a higher-level language, then optimize in assembly; keep the fastest or smallest.
- Once the inner loops are optimized, try to reduce the number of loops used.
I recently wrote down (for an article) some notes on speed optimization in code execution...
Specifically in the context of oldskool democoding on PC/DOS:
- Make your routines as self-contained and monolithic as possible. You don’t want to be jumping around between various external procedures and functions inside a loop.
- Minimize memory accesses; in inner loops, try to rely only on CPU registers for intermediate calculations.
- Design your data structures so that your algorithm reads memory linearly. This way you make better use of the CPU cache.
- Look for ways to reduce computations inside the inner loop; think about what can be moved one level higher. Anything that can be calculated outside the loop should be moved out of it!
- Use partial loop unrolling. This saves cycles both on jumps and on register counter operations, and at the same time makes code prefetching more effective.
- Minimize branching inside inner loops.
- Read from and write to memory not in single bytes, but in “words” (16-bit) or “double words” (32-bit).
- Minimize multiplications and divisions. Replace them with bitwise tricks where possible. Use fixed-point arithmetic instead of floating-point. (If you dig deeper into the subject, there are whole books about it — e.g. Hacker’s Delight — but be careful, that’s heavy stuff!)
- Use lookup tables with precomputed values.
- Apply self-modifying code.
- Write narrow, non-scalable code. Hardcoding is your friend :)
The less flexible and configurable your code is, the better you can optimize it for a specific case.
- And of course: ingenuity, trickery, and fooling the viewer.
Specifically in the context of oldskool democoding on PC/DOS:
- Make your routines as self-contained and monolithic as possible. You don’t want to be jumping around between various external procedures and functions inside a loop.
- Minimize memory accesses; in inner loops, try to rely only on CPU registers for intermediate calculations.
- Design your data structures so that your algorithm reads memory linearly. This way you make better use of the CPU cache.
- Look for ways to reduce computations inside the inner loop; think about what can be moved one level higher. Anything that can be calculated outside the loop should be moved out of it!
- Use partial loop unrolling. This saves cycles both on jumps and on register counter operations, and at the same time makes code prefetching more effective.
- Minimize branching inside inner loops.
- Read from and write to memory not in single bytes, but in “words” (16-bit) or “double words” (32-bit).
- Minimize multiplications and divisions. Replace them with bitwise tricks where possible. Use fixed-point arithmetic instead of floating-point. (If you dig deeper into the subject, there are whole books about it — e.g. Hacker’s Delight — but be careful, that’s heavy stuff!)
- Use lookup tables with precomputed values.
- Apply self-modifying code.
- Write narrow, non-scalable code. Hardcoding is your friend :)
The less flexible and configurable your code is, the better you can optimize it for a specific case.
- And of course: ingenuity, trickery, and fooling the viewer.
I only have one technique:
- Measure.
The better (i.e., relevant, accurate, fast-to-get-an-answer) your measurements are, the better you will be able to optimize. Conversely, do crap on faith without benchmarks and profiles, and you'll just end up flailing around.
- Measure.
The better (i.e., relevant, accurate, fast-to-get-an-answer) your measurements are, the better you will be able to optimize. Conversely, do crap on faith without benchmarks and profiles, and you'll just end up flailing around.
- Turn off the dynamic voltage and frequency scaling
(For more stable measurements that doesn't depend on your device's temperature or batterylife if applicable)
Since first pentiums, CPU wise, cache efficiency does speed. Only that. forget anything else. factor of speed of an algorythm that has data "adapted to cache" and another "not adapted to cache" is * around 50 *. Sometimes more instructions, more heavy code, lead to more speed if it pleases the cache.
So:
-Align your structs on your CPU cacheline size.
-for each pass of an algorithm, each data vector should only be read or only written.
- if you fully writes some region, and your CPU has the dreaded "declare cache loaded when it was not and just fill zero in L1" asm instruction, use it.
- Use GPU not CPU, and forget previous instructions.
So:
-Align your structs on your CPU cacheline size.
-for each pass of an algorithm, each data vector should only be read or only written.
- if you fully writes some region, and your CPU has the dreaded "declare cache loaded when it was not and just fill zero in L1" asm instruction, use it.
- Use GPU not CPU, and forget previous instructions.
WRITE EVERYTHING IN 100% ASM!
There are much bigger gains in strategic optimization (approach and architecture) than in optimizing code. Of course you can optimize code also if you enjoy this time-consuming, fiddly kind of stuff. I tend to write difficult algorithms in Lua, and when everything seems to be in place and exactly to the point, I just write it down again in assembly like a human compiler. This step is fun and easy and can be done with a beer in the hand. More often than not things then just work, no further debugging required.
My trick is to brush my teeth while showering
Optimization for demoscene specific, becomes a kind of state of mind or knowledge gained by experience. A lot of the suggestions are general knowledge, but sometimes there are experience that contradicts some in the generic sense.
Like, I always knew the golden rule that cache matters now and maybe if in the past you did huge LUTs it's not good anymore. But I find counter examples where LUTs still give me more. Maybe the LUTs were small or I was reading the data continuously so I couldn't see the real penalty, who knows. But in my experience, when I designed one effect and wasn't getting enough and then I switched to some alternatives I was afraid to use or be frowned upon if someone saw my code because "we all know memory is slow now" (I remember this infographic on twitter showing memory being like 50x slower and could be even slower than a trigonometric floating point operation, which I think is an exagerration unless you somehow disable all caches?).
p.s. In our Pentium demo I switched from ORing bits for 16bpp to, 8bit buffer with pseudopalette 16bit, to finally something akin to a map of two 8bit pixel values back to 256k big table of ever 32bit pixel combination (2 pixels at once, but huge table that should be trashing cache). And ended up with 2x speed which is counterintuitive. I thought previously I am making the dumbest thing ever knowing cache matters and this is a huge table esp. for Pentium. But after testing I was like "wow, that's actually quite faster". Was I having horrible code before and that's it? Or maybe cache might matter but not with blind faith about it?
Anyway, my take is "premature assumptions are the root of all evil" and just test various things and performance in different architectures and make up your mind what you like best, maybe also some weird thing doesn't help much as you thought, or some part is not so important for performance, but in other cases some dump thing somehow saved the day.
Like, I always knew the golden rule that cache matters now and maybe if in the past you did huge LUTs it's not good anymore. But I find counter examples where LUTs still give me more. Maybe the LUTs were small or I was reading the data continuously so I couldn't see the real penalty, who knows. But in my experience, when I designed one effect and wasn't getting enough and then I switched to some alternatives I was afraid to use or be frowned upon if someone saw my code because "we all know memory is slow now" (I remember this infographic on twitter showing memory being like 50x slower and could be even slower than a trigonometric floating point operation, which I think is an exagerration unless you somehow disable all caches?).
p.s. In our Pentium demo I switched from ORing bits for 16bpp to, 8bit buffer with pseudopalette 16bit, to finally something akin to a map of two 8bit pixel values back to 256k big table of ever 32bit pixel combination (2 pixels at once, but huge table that should be trashing cache). And ended up with 2x speed which is counterintuitive. I thought previously I am making the dumbest thing ever knowing cache matters and this is a huge table esp. for Pentium. But after testing I was like "wow, that's actually quite faster". Was I having horrible code before and that's it? Or maybe cache might matter but not with blind faith about it?
Anyway, my take is "premature assumptions are the root of all evil" and just test various things and performance in different architectures and make up your mind what you like best, maybe also some weird thing doesn't help much as you thought, or some part is not so important for performance, but in other cases some dump thing somehow saved the day.
- Make a prototype in an environment which allows for easy experimentation and debugging. For example C++ and OpenGL immediate mode or SDL software surface with dead-simple access so you can draw boxes and lines etc. whenever you want.
- Determine bounds on variables and rely on them. This allows you to avoid bounds-checking and reason about edge cases.
- Determine error bounds and necessary precision, so that optimization opportunities can be found where precision can be lowered.
- Understand the target hardware as best as possible - your solutions are expressed as software, but the problems are defined in the domain of hardware.
- Make sure that your decisions are based on either measurements or on first-principles analysis.
- The way that your data is stored and organized (in memory but also on-disk) matters a lot more than how you use it.
Quote:
p.s. In our Pentium demo I switched from ORing bits for 16bpp to, 8bit buffer with pseudopalette 16bit, to finally something akin to a map of two 8bit pixel values back to 256k big table of ever 32bit pixel combination (2 pixels at once, but huge table that should be trashing cache).
That “should” is doing a lot of heavy lifting. Most Pentiums have 512 kB of L2 cache, so half your table fits nicely in there assuming you mean 256k entries of 32 bits each; and most likely, not all of it is equally hot. (Pentium was before Intel started with performance counters, so unfortunately, it's not as easy to just measure the hit rate directly on the target CPU. A newer CPU or a simulator may work OK-ish as a proxy.)
Also, note that Pentium is now a 30+ year old design, so what you read about modern CPUs doesn't always really apply…
Quote:
Was I having horrible code before and that's it? Or maybe cache might matter but not with blind faith about it?
Bad coders have just one trick and follow it blindly. (For some reason, false sharing seems to be a popular holy cow.) Good coders will know that some things are more important than others, but none totally dominate in all situations.
Quote:
And of course: ingenuity, trickery, and fooling the viewer.
Fooling the viewer…. Yes! More specifics on this one please!
On cache: Look up exactly what N-way your target is, and what bits of an address is used for the cache index. Then layout your data so that you get as many hits per cache line load as possible and avoid using more than N cached buffer mapping to same index as you get cache thrashing.
Math: Forward differencing, and, compute every N points and interpolate between. Reparametrize: if you have a 2d or 3d lut, see if some parameterization/symmetry can reduce/approximate it to 1d or 2d.
A good order of things for me for oldskool ppc has been: try out things on a powerful machine, set each pixel using whatever expensive math and sampling, then replace with forward differencing and interpolation, then find a way to use lut:s for the math, then do tiled indexing to be more cache friendly, then asm and measure and compare various ideas and trade offs.
A random well known trick: overlap your cos and sin table (better cache usage).
Optimus, nice one. 256kb table was likely efficient due to adjacent lookups having ”near colors” (many hits per load) and only a limited used set of combos so never needed full 256kb so low thrashing. What did you store in it that needed 32 bits? Hmm was [pixA][pixB] two calculated/sampled pixels and then the buffer held 4 pixels where two were interpolation between them?
Quote:
Fooling the viewer…. Yes! More specifics on this one please!
Classic trick of mine, abusing 256 color quantization for images for better compression.
Quote:
Optimus, nice one. 256kb table was likely efficient due to adjacent lookups having ”near colors” (many hits per load) and only a limited used set of combos so never needed full 256kb so low thrashing. What did you store in it that needed 32 bits? Hmm was [pixA][pixB] two calculated/sampled pixels and then the buffer held 4 pixels where two were interpolation between them?
I had a pseudo 8bit buffer, 256 colors, rendered with a uint16 pal[256]
Then I generated a uint32 pal32[256*256], I read two bytes with a uint16 pointer from src, use as index directly to get uint32 value = two 16bit pixels.
I did some test again in a clean project in 320x200x16bpp on my Pentium 166mhz. Yes there was a bit if improvement on some plasma (smooth colors), but not as much as I remember from old demo. I even paused the plasma update and just copied buffer directly. I made also noise with random pixels to compare worse cases for cache indeed there is. Smooth plasma was from 132fps to 156fps, but noise went from 132fps to 117fps so it lost.
Anyway, there are few places in demo I use this. We also use a lot of other techniques. One other thing there was struggle at the time, we fade out things in the distance, so after you get the 16bpp texture, common thing is to unpack RGB, mul them with something from 0-255 and shift >> 8 to darken the color, then you have to pack back RGB again. But we quickly realized it was very costly (I thought pentium would be faster). Tunnel barely made it to 30fps even less. But then we did the naive, pseudo 16bit palette, with extra darker versions of it, so uint16 fadePal[256 * NUM_SHADES]. On tunnel it went to 50-60fps. Tunnel wasn't my code but looked very simple otherwise. I then reused this thing everywhere, to fade away clouds, voxels, etc.. however very later I figure out was to do 32bit mixes to blend or fade colors multiple channels at once. We didn't move to MMX which would give even more.
I kinda struggle though with the fact most of my stuff so far weren't very cache aware. It just happened to be fast enough to not investigate it. Like, if I wrote something and was horrible slow, then I would start searching around it. But since things weren't that slow and LUTs seemed to help me, I am not sure about it.
Like, what tools are there (esp in DOS) to run my code and get a report of the percentage of cache hits/penalties? How do you even investigate what's going on?
There was one moment on the Voxelscape, that I decided to do some cache optimization, but I didn't know what I was doing and got no improvement. I had the heightmap and colormap stored in separated arrays, each was 1024x1024 8bpp (quite big). I simply made a struct unifying the two, struct voxeldata {ubyte height; ubyte color;}. I thought then it was a good idea. But probably the bottleneck was somewhere else and I don't remember any improvement. How do I know where do I have many cache penalties and I need to improve?
Like, what tools are there (esp in DOS) to run my code and get a report of the percentage of cache hits/penalties? How do you even investigate what's going on?
There was one moment on the Voxelscape, that I decided to do some cache optimization, but I didn't know what I was doing and got no improvement. I had the heightmap and colormap stored in separated arrays, each was 1024x1024 8bpp (quite big). I simply made a struct unifying the two, struct voxeldata {ubyte height; ubyte color;}. I thought then it was a good idea. But probably the bottleneck was somewhere else and I don't remember any improvement. How do I know where do I have many cache penalties and I need to improve?
Final thing, would cache optimization give 10x improvement in some cases? Or does the original code have to be very seriously bad? Yesterday I was watching this https://www.youtube.com/watch?v=ztkh1r1ioZo where it claims instead of traversing voxels on a linearly stored volume, doing some zig-zag storage in something called Lebesque curved, instantly gave 10x speed. I find hard to believe, but I think that's the next thing to try on modern CPU with some huge voxel. I know traversal linearly would jump far in memory from sample to sample, but not like such penalty that fixing this would give 10x. Is it true?
It really depends. But if you're talking a modern machine: L1 cache hit = 2–3 cycles. Main memory lookup: 400–500 cycles. If your runtime is dominated by memory lookups (say, just traversing one big linked list that doesn't fit in the cache at all), then sure, you can gain 10x. But most code isn't really like that and modern CPUs are really good at hiding memory latency with out-of-order execution.
The numbers will be much less skewed for Pentium; RAM hasn't gotten that much faster in absolute terms (when talking about latency; in terms of bandwidth, it absolutely has), but the rest of the CPU has.
The numbers will be much less skewed for Pentium; RAM hasn't gotten that much faster in absolute terms (when talking about latency; in terms of bandwidth, it absolutely has), but the rest of the CPU has.
Optimus, ah I see. So instead of:
Having a mental model of the cache will help, and then just try some schemes and measure differences. On some cpu:s there are performance monitor registers so you can get cache hit/miss info, seems P6 has it!
Btw Gpu:s do morton order tiling because that is very cheap to do in silicon (just rewire the bits), but on cpu we have to just shake up the bits a little in some scheme that is fast.
Code:
, you do A16 =lut[A8] B16=lut[B8]
Code:
. Saving one load and the or shift. I think I never tried that one (downside of thinking too much about the cache.. it would immediately seem to expensive hehe.. the ppc I am doing stuff for stalls so bad on cache misses.. should try this! As for shading I do the same, I have 8bit texture that maps to 16bit, and then I have shade table, only 32 shade values per color entry, and then I can fade the lut instead)AB16 = x256x256[Apix8, Bpix8]
Having a mental model of the cache will help, and then just try some schemes and measure differences. On some cpu:s there are performance monitor registers so you can get cache hit/miss info, seems P6 has it!
Btw Gpu:s do morton order tiling because that is very cheap to do in silicon (just rewire the bits), but on cpu we have to just shake up the bits a little in some scheme that is fast.
You save one load, but you get (typically) the same amount of shifting and ORing. It's not like MOV supports reading multidimensional arrays natively (unless one of the dimensions is 1 or 2 or 4 bytes), so you just get them before the load instead of afterwards.
Well, OK, you can maybe save one shift, too, since you have AH/AL to combine 8-bit values (so you get the shift+or for free in the case of small LUT) but nothing similar for 16-bit (so you need to do the shift manually, but can get the or for free by moving into the lower half of the register after shifting).
Still requires your cache hit rate in the large LUT to be fantastic, though, to be able to win anything. And of course, using partial registers like that is going to be awful on Pentium Pro and such.
Still requires your cache hit rate in the large LUT to be fantastic, though, to be able to win anything. And of course, using partial registers like that is going to be awful on Pentium Pro and such.