Go to bottom

Short intro to removing code branches

category: code [glöplog]
Hey guys,

I wrote a post where I've tried to explain how to write branchless code and thought it would be great to hear some feedback from you on it. Here's the link

Let me know what you think!
added on the 2015-11-22 12:08:11 by jabudcha jabudcha
Have you actually benchmarked these two? Any normal C compiler these days is likely to transform something as simple as the first one to a conditional move instruction anyway, which is faster than trying to implement it yourself.

Not to mention that branches are not a problem; unpredictable branches are.
added on the 2015-11-22 12:20:17 by Sesse Sesse
I just ran a quick benchmark on this, and you're right, when the branches are not predictable I get some mixed results on which one executes faster, but when the branching is more predictable then this the branchless code seems to run slower.
added on the 2015-11-22 14:47:59 by jabudcha jabudcha
I compiled both of your variants with GCC (5.2.1), locking “some_other_value” to an arbitrary value (89). The first yields:

Code:imul edx,edi,0x17 lea eax,[rdi-0x5] cmp edi,0x59 cmovge eax,edx

and the branchless:

Code:imul eax,edi,0x17 lea ecx,[rdi-0x5] mov edx,0x59 sub edx,edi sar edx,0x1f xor eax,ecx and eax,edx xor eax,ecx

I'll let you draw your own conclusions. :-)
added on the 2015-11-22 14:56:29 by Sesse Sesse
So far, on my graphics card at least, such tricks are still useful for shader programming. But for the CPU, yeah, hard to beat the compiler (except using parallelization and cache optimization).
Sesse, in your branchless disassembly, the xor eax, ecx could be executed before (interleaved between 2 operations uppon edx), so that it would be free. Strange that the compiler did not figure it out. Or did I miss something?
added on the 2015-11-23 08:47:24 by Soundy Soundy
I think most CPUs already do out-of-order execution on their own, so that shouln't matter much?
Essential optimization.

Great for parallel procedsing, like opencl or even just glsl.

Because there a compiler that often makes outdated suboptimal assumptions on how to optimite for modern Hardware has to decide how to compile segments to pipelines.

The More constants and The less branches, The faster and better will ANY compiler perform.

And any branching Path can easily pause flow controll, up to doubling The number of gpu cycles.

Even in glsl a dotproduct can often be done faster than an if() statement due to parallel pipes being best without valves branching between them.
added on the 2016-04-03 20:46:07 by ollj ollj


Go to top