Help for vec3/4 Library Speedtest C/Linux
category: code [glöplog]
Thanks joooo :)
Doing some comparison tomorrow.
I got some feeling about the results , but lets see what comes around .
  
Doing some comparison tomorrow.
I got some feeling about the results , but lets see what comes around .
Quote:
So why not using intrinsics instead of raw assembler code? Such as those defined in e.g. xmmintrin.h and emmintrin.h ?
Very valid because of things that have been said already :) Also, using intrinsics doesn't cripple a compiler's optimization wits half as much as inline assembler (i.e. it has more information about what it is you're actually doing). Use inline ass. (yes, I went there) to tweak batchprocessing loops the compiler obviously screws up, not much else.
Please take your daily dose of ryg kool-aid: The Ubiquitous SSE vector class: Debunking a common myth. thank you.
  
that is, +1 to the "batchprocessing loops" of the poster above..
  
Thanks :)
Ryg puts it nicely once again. Even with intrinsics it might not be the best of ideas to port all math primitives per se, kind of depends on the usage pattern. I once did it in 2003 for some commercial code (Xbox1 & PC) and in the end the performance figures were not staggering, partly because of the retro way the math stuff was used (lots of here-and-there float math on individual components et cetera -- written years and years before).
So with intrinsics as well, it's best to try and fit the datatypes and operations into a piece of program flow that isn't necessarily going to be interleaved all over the fucking place. I know it's a bit lame to bring this up but the first thing I noticed when looking over the Doom3 SDK is that ID did pretty much the same thing (standard math in neat CPP classes and optimized batchprocessing functions in the 'SIMDProcessor' stuff).
It's a potentially powerful tool but no magic fix. Oh well :)
  
Ryg puts it nicely once again. Even with intrinsics it might not be the best of ideas to port all math primitives per se, kind of depends on the usage pattern. I once did it in 2003 for some commercial code (Xbox1 & PC) and in the end the performance figures were not staggering, partly because of the retro way the math stuff was used (lots of here-and-there float math on individual components et cetera -- written years and years before).
So with intrinsics as well, it's best to try and fit the datatypes and operations into a piece of program flow that isn't necessarily going to be interleaved all over the fucking place. I know it's a bit lame to bring this up but the first thing I noticed when looking over the Doom3 SDK is that ID did pretty much the same thing (standard math in neat CPP classes and optimized batchprocessing functions in the 'SIMDProcessor' stuff).
It's a potentially powerful tool but no magic fix. Oh well :)
(I reverted the stuff I "optimized", needless to say :))
  
yeah, i also stole the SIMDProcessor idea from the D3 SDK back in the days. Actually, the first time i came across this concept was when reading the public brochures about Hybrid's/CNCD's SurRender3D API from around '99/2000 or so (dumpbin on the SR dlls included in project chinadoll can give you some idea of the API btw). 
I remember being amazed when plain unrolling of the loops without SSE gave a super-big performance boost to my vertex deformation code :)
Geez, great times.
  
I remember being amazed when plain unrolling of the loops without SSE gave a super-big performance boost to my vertex deformation code :)
Geez, great times.
I've had some great wins *with* intrinsic SSE: marching cubes, Sutherland-Hodgman software shadow clipping, modified ADPCM decoding, that kind of thing. Otherwise yeah, use with care :)
Once tried a 128-bit hash compare for fun, one with SSE intrinsics, one with inline assembler standard fare x86. The former actually only got faster after more than a 100k compares on sequential memory :)
But the concern Ryg raises about haphazard memory access,which I've sure as hell seen with VC, is much better with the Intel compiler, as stated before.
Oh well I'll get back to being rusty, old, alcoholic and debilitated now.
  
Once tried a 128-bit hash compare for fun, one with SSE intrinsics, one with inline assembler standard fare x86. The former actually only got faster after more than a 100k compares on sequential memory :)
But the concern Ryg raises about haphazard memory access,which I've sure as hell seen with VC, is much better with the Intel compiler, as stated before.
Oh well I'll get back to being rusty, old, alcoholic and debilitated now.
d00m.org/~kmw/results/results.html
Sorry it took so long... just felt to give you at least some fancy plots :)
  
Sorry it took so long... just felt to give you at least some fancy plots :)
these plots are weird! it would be more natural and readable to use x axis for array sizes, while using different lines/colors for different operations. 
  
i know, but matplotlib only did ugly bar charts , so i took this layout
not intuitive but well... better then nothing :P
  
not intuitive but well... better then nothing :P
well, if you still have the raw data, it's really easy to do all kind of plots in gnuplot
  




