Go to bottom

Help for vec3/4 Library Speedtest C/Linux

category: code [glöplog]
Thanks joooo :)

Doing some comparison tomorrow.
I got some feeling about the results , but lets see what comes around .

So why not using intrinsics instead of raw assembler code? Such as those defined in e.g. xmmintrin.h and emmintrin.h ?

Very valid because of things that have been said already :) Also, using intrinsics doesn't cripple a compiler's optimization wits half as much as inline assembler (i.e. it has more information about what it is you're actually doing). Use inline ass. (yes, I went there) to tweak batchprocessing loops the compiler obviously screws up, not much else.
added on the 2011-12-14 20:05:25 by superplek superplek
Please take your daily dose of ryg kool-aid: The Ubiquitous SSE vector class: Debunking a common myth. thank you.
added on the 2011-12-15 16:42:08 by spike spike
that is, +1 to the "batchprocessing loops" of the poster above..
added on the 2011-12-15 16:43:27 by spike spike
Thanks :)

Ryg puts it nicely once again. Even with intrinsics it might not be the best of ideas to port all math primitives per se, kind of depends on the usage pattern. I once did it in 2003 for some commercial code (Xbox1 & PC) and in the end the performance figures were not staggering, partly because of the retro way the math stuff was used (lots of here-and-there float math on individual components et cetera -- written years and years before).

So with intrinsics as well, it's best to try and fit the datatypes and operations into a piece of program flow that isn't necessarily going to be interleaved all over the fucking place. I know it's a bit lame to bring this up but the first thing I noticed when looking over the Doom3 SDK is that ID did pretty much the same thing (standard math in neat CPP classes and optimized batchprocessing functions in the 'SIMDProcessor' stuff).

It's a potentially powerful tool but no magic fix. Oh well :)
added on the 2011-12-15 16:56:58 by superplek superplek
(I reverted the stuff I "optimized", needless to say :))
added on the 2011-12-15 16:57:15 by superplek superplek
yeah, i also stole the SIMDProcessor idea from the D3 SDK back in the days. Actually, the first time i came across this concept was when reading the public brochures about Hybrid's/CNCD's SurRender3D API from around '99/2000 or so (dumpbin on the SR dlls included in project chinadoll can give you some idea of the API btw).
I remember being amazed when plain unrolling of the loops without SSE gave a super-big performance boost to my vertex deformation code :)
Geez, great times.
added on the 2011-12-15 17:05:36 by spike spike
I've had some great wins *with* intrinsic SSE: marching cubes, Sutherland-Hodgman software shadow clipping, modified ADPCM decoding, that kind of thing. Otherwise yeah, use with care :)

Once tried a 128-bit hash compare for fun, one with SSE intrinsics, one with inline assembler standard fare x86. The former actually only got faster after more than a 100k compares on sequential memory :)

But the concern Ryg raises about haphazard memory access,which I've sure as hell seen with VC, is much better with the Intel compiler, as stated before.

Oh well I'll get back to being rusty, old, alcoholic and debilitated now.
added on the 2011-12-15 17:10:42 by superplek superplek
Sorry it took so long... just felt to give you at least some fancy plots :)
these plots are weird! it would be more natural and readable to use x axis for array sizes, while using different lines/colors for different operations.
added on the 2012-09-25 08:35:50 by provod provod
i know, but matplotlib only did ugly bar charts , so i took this layout
not intuitive but well... better then nothing :P
well, if you still have the raw data, it's really easy to do all kind of plots in gnuplot
added on the 2012-09-25 10:21:08 by provod provod


Go to top