Short intro to removing code branches
category: code [glöplog]
Hey guys,
I wrote a post where I've tried to explain how to write branchless code and thought it would be great to hear some feedback from you on it. Here's the link
Let me know what you think!
  
I wrote a post where I've tried to explain how to write branchless code and thought it would be great to hear some feedback from you on it. Here's the link
Let me know what you think!
Have you actually benchmarked these two? Any normal C compiler these days is likely to transform something as simple as the first one to a conditional move instruction anyway, which is faster than trying to implement it yourself.
Not to mention that branches are not a problem; unpredictable branches are.
  
Not to mention that branches are not a problem; unpredictable branches are.
I just ran a quick benchmark on this, and you're right, when the branches are not predictable I get some mixed results on which one executes faster, but when the branching is more predictable then this the branchless code seems to run slower.
  
I compiled both of your variants with GCC (5.2.1), locking “some_other_value” to an arbitrary value (89). The first yields:
and the branchless:
I'll let you draw your own conclusions. :-)
  
Code:
imul   edx,edi,0x17
lea    eax,[rdi-0x5]
cmp    edi,0x59
cmovge eax,edx
and the branchless:
Code:
imul   eax,edi,0x17
lea    ecx,[rdi-0x5]
mov    edx,0x59
sub    edx,edi
sar    edx,0x1f
xor    eax,ecx
and    eax,edx
xor    eax,ecx
I'll let you draw your own conclusions. :-)
So far, on my graphics card at least, such tricks are still useful for shader programming. But for the CPU, yeah, hard to beat the compiler (except using parallelization and cache optimization).
Sesse, in your branchless disassembly, the xor eax, ecx could be executed before (interleaved between 2 operations uppon edx), so that it would be free. Strange that the compiler did not figure it out. Or did I miss something?
  
Sesse, in your branchless disassembly, the xor eax, ecx could be executed before (interleaved between 2 operations uppon edx), so that it would be free. Strange that the compiler did not figure it out. Or did I miss something?
I think most CPUs already do out-of-order execution on their own, so that shouln't matter much?
  
Essential optimization.
Great for parallel procedsing, like opencl or even just glsl.
Because there a compiler that often makes outdated suboptimal assumptions on how to optimite for modern Hardware has to decide how to compile segments to pipelines.
The More constants and The less branches, The faster and better will ANY compiler perform.
And any branching Path can easily pause flow controll, up to doubling The number of gpu cycles.
Even in glsl a dotproduct can often be done faster than an if() statement due to parallel pipes being best without valves branching between them.
  
Great for parallel procedsing, like opencl or even just glsl.
Because there a compiler that often makes outdated suboptimal assumptions on how to optimite for modern Hardware has to decide how to compile segments to pipelines.
The More constants and The less branches, The faster and better will ANY compiler perform.
And any branching Path can easily pause flow controll, up to doubling The number of gpu cycles.
Even in glsl a dotproduct can often be done faster than an if() statement due to parallel pipes being best without valves branching between them.





