the effectiveness of code
category: general [glöplog]
Each time I am adding an if clause, I am thinking - this will make my app slower since it needs to process another clause. At the same time it is usually the right course of action, since the app requires that logic.
So I was thinking whether my concerns are valid at all. For instance, if I am running a timer and the OnTimer event has a clause which seems to be required there but which is true only rarely, it means that each time the OnTimer event happens, processor has to evaluate this clause and this eats up resources and time, even if small. Is it true that this eating up is significant or can the computer actually process thousands of clauses without any significant delays?
Another usual situation is the processing of the keyboard events. If the app realizes the recursive scheme, then the KeyDown function has all the keyboard events in it and that means a very big set of clauses - if, else if, else if. And if the required key code is at the end of the clause line, it seems that it will be processed much later.
So I was thinking whether my concerns are valid at all. For instance, if I am running a timer and the OnTimer event has a clause which seems to be required there but which is true only rarely, it means that each time the OnTimer event happens, processor has to evaluate this clause and this eats up resources and time, even if small. Is it true that this eating up is significant or can the computer actually process thousands of clauses without any significant delays?
Another usual situation is the processing of the keyboard events. If the app realizes the recursive scheme, then the KeyDown function has all the keyboard events in it and that means a very big set of clauses - if, else if, else if. And if the required key code is at the end of the clause line, it seems that it will be processed much later.
Code:
#define likely(x) __builtin_expect((bool)(x), true)
#define unlikely(x) __builtin_expect((bool)(x), false)
I have yet to try this.
Code:
if(unlikely(esc_key_pressed)) exit_demo();
How much can be gained this way?
louigi: for the key press case, you could use a LUT of some kind instead of an if/elseif/etc for each key code. That would speed it up a lot.
This kind of stuff is usually only worth bothering with if it happens a whole lot though.. if you're talking of one if/else statement per second, the difference is likely to be unnoticable. If you're doing it in a loop that runs 10k times, then it's surely worth some time.
This kind of stuff is usually only worth bothering with if it happens a whole lot though.. if you're talking of one if/else statement per second, the difference is likely to be unnoticable. If you're doing it in a loop that runs 10k times, then it's surely worth some time.
Quote:
Is it true that this eating up is significant
It is if you're running on a 486.
Quote:
or can the computer actually process thousands of clauses without any significant delays?
It can if the clauses themselves don't contain something that's very expensive to calculate.
A totally simplified example: You are running on a 2,0GHz single-core computer that runs 2 billion instructions per second (theoretically). You have an if-clause that says "if (value == someothervalue && value2 == someothervalue2). A very naive assembler implementation of that will compile into six instructions. Assuming things are in cache, those six instructions will take 0.0000003% of your processing time.
Quote:
Another usual situation is the processing of the keyboard events. If the app realizes the recursive scheme, then the KeyDown function has all the keyboard events in it and that means a very big set of clauses - if, else if, else if. And if the required key code is at the end of the clause line, it seems that it will be processed much later.
Huh?
What neoneye said. It looks like that would take advantage of prediction bits in intel architecture... maybe that's what happens with that?
Compile to an assembly and look at the code the compiler produces to get a feel for how much your if adds. Also, try and move the if as close to the event that causes it as possible and away from your timer code.
just a few thoughts.
Compile to an assembly and look at the code the compiler produces to get a feel for how much your if adds. Also, try and move the if as close to the event that causes it as possible and away from your timer code.
just a few thoughts.
There's conditional instructions on many platforms and there's also branch prediction. That's if you're compiling code for a hardware platform and not running it in an interpreter (though that may have branch prediction and a JIT also).
So: I wouldn't give a fuck.
In theory compilers can optimize code that is executed more often more than other parts of the code. One benefit would be that the code size could shrink and more code could fit into the cache.
__builtin_expect() looks nice! Didn't know about it. This is probably good for time critical code. But:
So: I wouldn't give a fuck.
In theory compilers can optimize code that is executed more often more than other parts of the code. One benefit would be that the code size could shrink and more code could fit into the cache.
__builtin_expect() looks nice! Didn't know about it. This is probably good for time critical code. But:
Quote:
:)You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform.
At the risk of causing another Pouet-style flame-war:
It heavily depends on the plaform you program for (as Preacher said).
As a rule of thumb, conditional branching will break the instruction pipeline by a probability of 50%
and stall instruction execution until the pipeline is intact again.
Branch prediction, branch target buffering, intelligent prefetching and in some cases, instruction
set extensions (like the PowerPC ISEL instruction) help to avoid breaking the pipeline or reduce
the time to recover.
Then again, as far as i remember, the IA32 heavily depends on fairly long pipelines to perform so
that stalling a pipeline will slow it down measurably, which also explains why Intel's IA32 official
"Optimization Reference Manual" says "Eliminate branches wherever possible "right at the start of
chapter 3.4.1 (followed by how to eliminate branches directly below as chapter 3.4.1.1).
It heavily depends on the plaform you program for (as Preacher said).
As a rule of thumb, conditional branching will break the instruction pipeline by a probability of 50%
and stall instruction execution until the pipeline is intact again.
Branch prediction, branch target buffering, intelligent prefetching and in some cases, instruction
set extensions (like the PowerPC ISEL instruction) help to avoid breaking the pipeline or reduce
the time to recover.
Then again, as far as i remember, the IA32 heavily depends on fairly long pipelines to perform so
that stalling a pipeline will slow it down measurably, which also explains why Intel's IA32 official
"Optimization Reference Manual" says "Eliminate branches wherever possible "right at the start of
chapter 3.4.1 (followed by how to eliminate branches directly below as chapter 3.4.1.1).
Quite honestly none of that __builtin_expect stuff is going to solve the top poster's problem, which is an algorithm problem.
And my general rule is that if you can't be bothered to measure it, don't even bother to optimize.
as someone said, for testing keys, if you want it fast you should use a lookup table instead of chaining a lot of if(). Use a big table of pointers to code you want to execute, like that (in no particular programming language, but well :)):
This way there are no test nor branching involved.
Code:
void moveUp() { /* do things */ }
void moveLeft() { /* do things */ }
...
funcs table[4]{moveUp, moveLeft, moveDown, moveRight};
...
key=getkey()
function = table[key]; // get the function
function(); //run it
This way there are no test nor branching involved.
Quote:
This way there are no test nor branching involved.
But if you think about it, indirect function calls would most probably lead to a pipeline flush just like a normal branch, or not? The compiler/processor doesn't know which function is called until "key" is evaluated at runtime...
On the other hand, it at least solves the problem of massive if/else clauses.
also if you are really worried about possibly very large if/else/if chains, you can organize them into a balanced binary tree fashion instead of a linear chain fashion; that will reduce the worst case of N comparisons to a fix log2(N) comparisons.
but i wouldn't be very worried about this in most cases.
but i wouldn't be very worried about this in most cases.
Quote:
But if you think about it, indirect function calls would most probably lead to a pipeline flush just like a normal branch, or not? The compiler/processor doesn't know which function is called until "key" is evaluated at runtime...
So long as the evaluation isn't on the jump, as it is in a conditional branch, the pipeline will flow my friend. The location of the pointer can be evaluated far before the jump is reached *i think in my head*
Are there any pipeline-bubble and cache coherence debugging tools out there? Out of curiosity
jua: right, but an if/else has a worst case of N comparisons (so N pipeline flushes) while the lookup always does only one.
Optimisation hint: don't care about technical things like pipeline flush and all that... first look at your algorithm and make sure it's good :)
Optimisation hint: don't care about technical things like pipeline flush and all that... first look at your algorithm and make sure it's good :)
I was hoping this thread would have been about the effectiveness of code but it's just some boring code optimisation stuff.
Quote:
So long as the evaluation isn't on the jump, as it is in a conditional branch, the pipeline will flow my friend. The location of the pointer can be evaluated far before the jump is reached *i think in my head*
I wonder if current CPUs do that. It would need to see that there is a call upcoming with the target address in some register. Now it needs to know when that register's content has reached its *final value* before the call, to use that address and feed the pipeline (after all, the compiler might use the register to do various other things before using it to store the call address). Modern CPUs do some quite impressive hardware scheduling, but is it so advanced?
I don't know how "jump to register" is handled, i think it will just take the current register value at the time of feeding the pipeline and hope that works. Maybe it's better if you try to separate the lookup from the actual jump with some other code to ensure it goes smoothly.
Anyway, the CPU pipeline does branch prediction using statistics from the previous runs of each branch usually, for a keyboard reading i think this would end up using the "not taken" path everytime as you don't press keys that often. When there is no condition, there is no possible choice and it's up to you to ensure the right value is in the register/address before the pipeline is feeded with it. And I used 'no particular language', it can be done in asm if you need, and would still be better thant the if/else approach even on an arch with no pipeline :)
Anyway, the CPU pipeline does branch prediction using statistics from the previous runs of each branch usually, for a keyboard reading i think this would end up using the "not taken" path everytime as you don't press keys that often. When there is no condition, there is no possible choice and it's up to you to ensure the right value is in the register/address before the pipeline is feeded with it. And I used 'no particular language', it can be done in asm if you need, and would still be better thant the if/else approach even on an arch with no pipeline :)
Quote:
I wonder if current CPUs do that.
Me, too. Then there's an effective address calculation to read the entry from the table, which is
a data access (other cache, other pipeline - depending on the used architecture) which costs
some time which might conter the benefit of having an unconditional branch that's easier to
buffer and to predict (again, depending on the platform). Also, the function call will push and
pop stuff to/from the stack when returning, which a little code block behind an IF/ELSE wouldn't
do (unless, of course, the code behind the IF or the ELSE contains exactly that function call).
Also, little code behind IF/ELSE statements usually terminate by an unconditional branch to a
fixed address and these branches can be folded on some platforms, causing no stall at all.
Hmm.. you know I'm not sure.
I guess then what is more detrimental to the pipeline? Flush because of a conditional branch or unpredictable memory reads? I feel if table[key] where sufficiently far away from the jump things would flow. With an unconditional branch the processor can continue fetching as it were another instruction which would provide more space between the evaluation and the jump then a conditional jump would. In a conditional jump the evaluation and branch most likely will be done in one or two instructions.
I guess then what is more detrimental to the pipeline? Flush because of a conditional branch or unpredictable memory reads? I feel if table[key] where sufficiently far away from the jump things would flow. With an unconditional branch the processor can continue fetching as it were another instruction which would provide more space between the evaluation and the jump then a conditional jump would. In a conditional jump the evaluation and branch most likely will be done in one or two instructions.
This is all theory, fellas. Are there any debugging tools were we can see exactly what happens? Maybe if you had a very accurate intel emulator.
Quote:
This is all theory, fellas.
I agree, but i find it very interesting :-)
you can try this shit out with a performance analyzer. AMD has one, a bit crummy but it shows all the events (cache stalls etc...) .. I guess that's what Intel VTune's for but I'm too old so I am still under the impression it's a commercial product. It might not be anymore though.
Doesn't help I have no interest in fixing up a windows installation any time soon.
Doesn't help I have no interest in fixing up a windows installation any time soon.
yep, one of the better threads.
Louigi Verona: As long as it's about event handling you can do that with an interpreter interpreting an interpreter and still get away with it ;) really doesn't matter.
otherwise: use jump tables (or prepare your code so that the compiler can build its own) and in general make sure that no superfluous code is executed in critical code paths.
(although an if() is always faster than fxn call, which otherwise would do the if() and then early-out most of the time)
In case you missed it, here is an excerpt from the 1969 Apollo 11 Code:
:*)
(not that I really understand it \o/)
Louigi Verona: As long as it's about event handling you can do that with an interpreter interpreting an interpreter and still get away with it ;) really doesn't matter.
otherwise: use jump tables (or prepare your code so that the compiler can build its own) and in general make sure that no superfluous code is executed in critical code paths.
(although an if() is always faster than fxn call, which otherwise would do the if() and then early-out most of the time)
In case you missed it, here is an excerpt from the 1969 Apollo 11 Code:
Code:
# ***********************************************************************************
# DOUBLE PRECISION ROOT FINDER SUBROUTINE (BY ALLAN KLUMPP)
# ***********************************************************************************
#
# N N-1
# ROOTPSRS FINDS ONE ROOT OF THE POWER SERIES A X + A X + ... + A X + A
# N N-1 1 0
# USING NETON'S METHOD STARTING WITH AN INITIAL GUESS FOR THE ROOT. THE ENTERING DATA MUST BE AS FOLLOWS:
# A SP LOC-3 ADRES FOR REFERENCING PWR COF TABL
# L SP N-1 N IS THE DEGREE OF THE POWER SERIES
# MPAC DP X INITIAL GUESS FOR ROOT
#
# LOC-2N DP A(0)
# ...
# LOC DP A(N)
# LOC+2 SP PRECROOT PREC RQD OF ROOT (AS FRACT OF 1ST GUESS)
#
# Page 823
# THE DP RESULT IS LEFT IN MPAC UPON EXIT, AND A SP COUNT OF THE ITERATIONS TO CONVERGENCE IS LEFT IN MPAC+2.
# RETURN IS NORMALLY TO LOC(TC ROOTPSRS)+3. IF ROOTPSRS FAILS TO CONVERGE TO IN 8 PASSES, RETURN IS TO LOC+1 AND
# OUTPUTS ARE NOT TO BE TRUSTED.
#
# PRECAUTION: ROOTPSRS MAKES NO CHECKS FOR OVERFLOW OR FOR IMPROPER USAGE. IMPROPER USAGE COULD
# PRECLUDE CONVERGENCE OR REQUIRE EXCESSIVE ITERATIONS. AS A SPECIFIC EXAMPLE, ROOTPSRS FORMS A DERIVATIVE
# COEFFICIENT TABLE BY MULTIPLYINE EACH A(I) BY I, WHERE I RANGES FROM 1 TO N. IF AN ELEMENT OF THE DERIVATIVE
# COEFFICIENT TABLE = 1 OR >1 IN MAGNITUDE, ONLY THE EXCESS IS RETAINED. ROOTPSRS MAY CONVERGE ON THE COREECT
# ROOT NONETHELESS, BUT IT MAY TAKE AN EXCESSIVE NUMBER OF ITERATIONS. THEREFORE THE USER SHOULD RECOGNIZE:
# 1. USER'S RESPONSIBILITY TO ASSUR THAT I X A(I) < 1 IN MAGNITUDE FOR ALL I.
# 2. USER'S RESPONSIBILITY TO ASSURE OVERFLOW WILL NOT OCCUR IN EVALUTATING EITHER THE RESIDUAL OR THE DERIVATIVE
# POWER SERIES. THIS OVERFLOW WOULD BE PRODUCED BY SUBROUTINE POWRSERS, CALLED BY ROOTPSRS, AND MIGHT NOT
# PRECLUDE EVENTUAL CONVERGENCE.
# 3. AT PRESENT, ERASABLE LOCATIONS ARE RESERVED ONLY FOR N UP TO 5. AN N IN EXCESS OF 5 WILL PRODUCE CHAOS.
# ALL ERASABLES USED BY ROOTPSRS ARE UNSWITCHED LOCATED IN THE REGION FROM MPAC-33 OCT TO MPAC+7.
# 4. THE ITERATION COUNT RETURNED IN MPAC+2 MAY BE USED TO DETECT ABNORMAL PERFORMANCE.
# STORE ENTERING DATA, INITIALIZE ERASABLES
ROOTPSRS EXTEND
QXCH RETROOT # RETURN ADRES
TS PWRPTR # PWR TABLE POINTER
DXCH MPAC +3 # PWR TABLE ADRES, N-1
CA DERTABLL
TS DERPTR # DER TABL POINTER
TS MPAC +5 # DER TABL ADRES
CCS MPAC +4 # NO POWER SERIES DEGREE 1 OR LESS
TS MPAC +6 # N-2
CA ZERO # MODE USED AS ITERATION COUNTER. MODE
TS MODE # MUST BE POS SO ABS WON'T COMP MPAC+3 ETC.
# COMPUTE CRITERION TO STOP ITERATING
EXTEND
DCA MPAC # FETCH ROOT GUESS, KEEPING IT IN MPAC
DXCH ROOTPS # AND IN ROOTPS
INDEX MPAC +3 # PWR TABLE ADRES
CA 5 # PRECROOT TO A
TC SHORTMP # YIELDS DP PRODUCT IN MPAC
TC USPRCADR
CADR ABS # YIELDS ABVAL OF CRITERION ON DX IN MPAC
DXCH MPAC
DXCH DXCRIT # CRITERION
# SET UP DER COF TABL
# Page 824
EXTEND
INDEX PWRPTR
DCA 3
DXCH MPAC # A(N) TO MPAC
CA MPAC +4 # N-1 TO A
DERCLOOP TS PWRCNT # LOOP COUNTER
AD ONE
TC DMPNSUB # YIELDS DERCOF = I X A(I) IN MPAC
EXTEND
INDEX PWRPTR
DCA 1
DXCH MPAC # (I-1) TO MPAC, FETCHING DERCOF
INDEX DERPTR
DXCH 3 # DERCOF TO DER TABLE
CS TWO
ADS PWRPTR # DECREMENT PWR POINTER
CS TWO
ADS DERPTR # DECREMENT DER POINTER
CCS PWRCNT
TCF DERCLOOP
# CONVERGE ON ROOT
ROOTLOOP EXTEND
DCA ROOTPS # FETCH CURRENT ROOT
DXCH MPAC # LEAVE IN MPAC
EXTEND
DCA MPAC +5 # LOAD A, L WITH DER TABL ADRES, N-2
TC POWRSERS # YIELDS DERIVATIVE IN MPAC
EXTEND
DCA ROOTPS
DXCH MPAC # CURRENT ROOT TO MPAC, FETCHING DERIVATIVE
DXCH BUF # LEAVE DERIVATIVE IN BUF AS DIVISOR
EXTEND
DCA MPAC +3 # LOAD A, L WITH PWR TABL ADRES, N-1
TC POWRSERS # YIELDS RESIDUAL IN MPAC
TC USPRCADR
CADR DDV/BDDV # YIELDS -DX IN MPAC
EXTEND
DCS MPAC # FETCH DX, LEAVING -DX IN MPAC
DAS ROOTPS # CORRECTED ROOT NOW IN ROOTPS
TC USPRCADR
CADR ABS # YIELDS ABS(DX) IN MPAC
EXTEND
# Page 825
DCS DXCRIT
DAS MPAC # ABS(DX)-ABS(DXCRIT) IN MPAC
CA MODE
MASK BIT4 # KLUMPP SAYS GIVE UP AFTER EIGHT PASSES
CCS A
BADROOT TC RETROOT
INCR MODE # INCREMENT ITERATION COUNTER
CCS MPAC # TEST HI ORDER DX
TCF ROOTLOOP
TCF TESTLODX
TCF ROOTSTOR
TESTLODX CCS MPAC +1 # TEST LO ORDER DX
TCF ROOTLOOP
TCF ROOTSTOR
TCF ROOTSTOR
ROOTSTOR DXCH ROOTPS
DXCH MPAC
CA MODE
TS MPAC +2 # STORE SP ITERATION COUNT IN MPAC+2
INDEX RETROOT
TCF 2
DERTABLL ADRES DERCOFN -3
:*)
(not that I really understand it \o/)
(ok, looks messed up, found it here)