## does sse suck!!???

**category:**general [glöplog]

some time ago i wrote a nice realtime raytracer with

single precision fpu operations. but som arithmeticals

were real slow like fsqrt or fdiv.

so i decided 2 buy a new nice computer with some more power and simd 2 support.

i thought the accelation will be great, because of 4 operations at once and aproximations but my results where real crap.

is it not possible to write a raytracer with sse because of the ablute error of the approximations?

i'm very frustrated, because i wanted to present a nice demo on breakpoint 2003.

anyone know a solution?

single precision fpu operations. but som arithmeticals

were real slow like fsqrt or fdiv.

so i decided 2 buy a new nice computer with some more power and simd 2 support.

i thought the accelation will be great, because of 4 operations at once and aproximations but my results where real crap.

is it not possible to write a raytracer with sse because of the ablute error of the approximations?

i'm very frustrated, because i wanted to present a nice demo on breakpoint 2003.

anyone know a solution?

...problem solved...

there is a pssibility to rise pecision with some additional operations.

see:

http://www.agner.org

there is a pssibility to rise pecision with some additional operations.

see:

http://www.agner.org

i'm not aware of any precision limitations with SSE or SSE2. The operands are supposed to be the same precision as IEEE floats or doubles.

the problem was the reciprocal approximation

the precision is just 12 bit... to less for primary rayintersection calculations.

but with Newton-Raphson formula u can improve

precision to 23 bit

newton-raphson with reciprocal approximation is

faster at all then using divps or sqrtps

here are the formulas:

squareroot reciprocal

x0 = RSQRTSS(a)

x1 = 0.5 * x0 * (3 - (a * x0)) * x0) (rised precision)

reciprocal:

x0 = RCPSS(d)

x1 = x0 * (2 - d * x0) = 2*x0 - d * x0 * x0

simple squereroot:

sqrt(x)= x*rsqrt(s)

the precision is just 12 bit... to less for primary rayintersection calculations.

but with Newton-Raphson formula u can improve

precision to 23 bit

newton-raphson with reciprocal approximation is

faster at all then using divps or sqrtps

here are the formulas:

squareroot reciprocal

x0 = RSQRTSS(a)

x1 = 0.5 * x0 * (3 - (a * x0)) * x0) (rised precision)

reciprocal:

x0 = RCPSS(d)

x1 = x0 * (2 - d * x0) = 2*x0 - d * x0 * x0

simple squereroot:

sqrt(x)= x*rsqrt(s)

:-)

s=x

s=x

:-)

s=x

s=x

Use geometrical raytracing and not conventional raytracing. It is much faster, and need less precission.

Does SSE suck? Compared to what? AltiVec? yes ;)

i use geometrical raytraycing:

on my mobile p4 2,8 i have framerate of ~25

on resol 640*200

on my unoptimized version:

www.lunatic-site.de/rayasm.exe

there is no lighting and shadows visible because

of reflection (depth 10)

sse does not suck anymore :-) (in compare to fpu precision)

on my mobile p4 2,8 i have framerate of ~25

on resol 640*200

on my unoptimized version:

www.lunatic-site.de/rayasm.exe

there is no lighting and shadows visible because

of reflection (depth 10)

sse does not suck anymore :-) (in compare to fpu precision)

Hi again LuNAtiC. Sorry, but your raytracer looks to be very slow (anyway it looks to be optimized in size). Do you know what I mean with geometrical raytracing? I got in a fullscreen 320x240 in a PII 233 15 fps, with 16 spheres, some of these reflecting. I used full integer calc. It is possible by the geometrical raytracing, because 32bits integers looks to be not enough to do conventional raytracing. With geometrical raytracing I mean to use rotatins to simplify the calc, to reduce the equations in one grade, so the sphere intersection is just one grade equation. Doing the rotations and the intersection with only one grade, is faster to do the second grade ecuations, and also you don't need one sqrt since you don't need to normalize the rays vector. And about the 10 depth in reflection, you scene looks as if it would be the same with 4-5 depth, and also, I'm sure that it would be the same as fast if you use only 2-3 depth, since when you do more reflections on it, it is only a very little part of the screen. Anyway it looks beatiful, why don't you try to do some better water? perlin noise works very good with the water distorsion, or, at least, use more harmonics in the water, now it looks too "sinusoidal".

I wrote "is faster to do the second grade ecuations" and I wanted to wrote "is faster than to do..."

i don't know, how u can reduce the equation by one grade. can u give an example?

my rays aren't normalized. they are just direction vectors. and i resolve the scalar, i need 2 mutiplicate with to get entrypoint.

the sqrt is needed to get the entry/leaving point of the sphere. how u can resolve this dualism other way then with sqrt?

dose it work with other objects as well? for example ellipsoid is very important for me to build some complex object with booleans.

the raytraycer ist not optimized at all. not in speed and not in size, but it'S fully written with nasm.

i try to create a 4k for the breakpoint 2003.

using apack it is just 2,8kb.

but there is not implemented refraction, bool, octree and sound yet.

the water is bumped by one animated texture (256*256). it must be tileable, thats why it looks a bit linear.

my rays aren't normalized. they are just direction vectors. and i resolve the scalar, i need 2 mutiplicate with to get entrypoint.

the sqrt is needed to get the entry/leaving point of the sphere. how u can resolve this dualism other way then with sqrt?

dose it work with other objects as well? for example ellipsoid is very important for me to build some complex object with booleans.

the raytraycer ist not optimized at all. not in speed and not in size, but it'S fully written with nasm.

i try to create a 4k for the breakpoint 2003.

using apack it is just 2,8kb.

but there is not implemented refraction, bool, octree and sound yet.

the water is bumped by one animated texture (256*256). it must be tileable, thats why it looks a bit linear.

That raytracer example was very beautifull. I am really hot to watch both lunatic's and texel's incoming demos when they will be finally released!

to texel: I got your email, thanks for that. I like big emails and this one is the most nice and interesting I got since ages. I will reply to you a bit later, whenever I will be free again, cause my PC fucked up again, I have to finish that CPC demo and I am preety busy these days..

to texel: I got your email, thanks for that. I like big emails and this one is the most nice and interesting I got since ages. I will reply to you a bit later, whenever I will be free again, cause my PC fucked up again, I have to finish that CPC demo and I am preety busy these days..

uhhh.. i think once upon a time people actually used fixed point and table lookups for vector rotations etc.

i'm not sure why my mind brought that up. ignore.

i'm not sure why my mind brought that up. ignore.

i remind that times, too...

...but this times are past since pentium fpu performance

...but this times are past since pentium fpu performance

Hi again LuNAtiC. Geometrical raytracing uses the advantages of parallel raytracing. For example, in parallel raytracing your vector is always (0,0,1), so it is so easy to check for intersections. So, what you do is to rotate the full world to make the rotated world ray vector be (0,0,1). It could be very good if you have something to rotate so fast, as SSE should be. With integers works so good. And, about the fpu performance, it is noway as fast enough as full integers programming. I mean, using fixed point math using variable precission and the best optimizing possible. In this way, and without mmx or that shit, you can get about 3 to 4 times more performance that using only the fpu, even in new pentiums or athlons. It is much harder to code in this way... but it is very good. About sse, I suppose that if it accelerate fpu calcs, then it will get about the same power of a good integer calc, but with more precission, obviously. But, if you use mmx, then I think the full integers with mmx will be the faster. About why people don't use to use now fixed point and that, it is by obviously questions, as accelerators that make you don't need to converse floats to integers (a very slow task), the high power of new computers... if you need to rotate and translate 20.000 vertices for example, it is not a problem at all... and that things. But, in any case, integer calc is the faster if you are doing software rendering.

ok... i c it is a nice method 2 render spheres, because the shape is = from every side u look @. but as i told this seems 2 b just fast 4 spheres. calculations of a simple othogonal plane is much easier with not rotated room, because it is a simple division u have 2 calculate just unce a ray. anyway it is possible 2 calc planes this way as well, but ellipsoids.... forget it.

it was nice work 2 develop the formula for ellipsoid as it is, but rotated i think it is out of my imagination.

i think it is more complex then.

one vehicle i constructed of that simple objects (most of them just cut out something out of others)

consists of:

9 planes

7 cylinders

4 ellipsoids

and just 3 spheres

it will b not worth if this method just profites from spheres. maybe to combine this 2 methods is optimal.

i will check this up when i've got some time 2 take pecil &paper and think about it.

...hmmm junggler maybe works this way... yeah it's realy fast

it was nice work 2 develop the formula for ellipsoid as it is, but rotated i think it is out of my imagination.

i think it is more complex then.

one vehicle i constructed of that simple objects (most of them just cut out something out of others)

consists of:

9 planes

7 cylinders

4 ellipsoids

and just 3 spheres

it will b not worth if this method just profites from spheres. maybe to combine this 2 methods is optimal.

i will check this up when i've got some time 2 take pecil &paper and think about it.

...hmmm junggler maybe works this way... yeah it's realy fast

this is probrobly a little basic for what you want to do, but there is some info on useing SSE for raytracing here

lunatic, for cilinders is very fast too, just calc the distance point to line (the ortogonal distance), if the cilinder vector line is not (0,0,1), in that case, just as a sphere. And it is better for metaballs too. Well, for ellipsoids, it is good too... just use your geometry knowlegde, it is just a proyection! Well, I'm not sure is for ellipsoids it is faster or not... but in any case, if we suppose that the rotations are near to be free (I mean, accelerated by sse or getting the high speed of integers or any way), then, the geometrical is always better. But, when you are rendering a plane, the rotation may cost too much, so it is not good to do the geometrical process.

I'm doing something for a demo, with about 500 textured cubes and some spheres and reflection... you will take a look of it soon... maybe before 20 of this month if I have time enough to finish the demo.

I'm doing something for a demo, with about 500 textured cubes and some spheres and reflection... you will take a look of it soon... maybe before 20 of this month if I have time enough to finish the demo.

i am anxious to it

isn't the size of sse opcodes a disadvantage for writing a 4k intro?

just asking ;-)

just asking ;-)

Well nystep, it it takes one SSE instruction to perform something that would require 8 x86 instructions, I guess it's beneficial.

in difference 2 mmx u can use fpu and sse at once.

so its not a problem 2 take the shorter opcodes if there ara just single scalar pultiplications.

i don't know, but maybe using both at once will be parallelized in the pipeline, so there ist also an advantage.. i have 2 read more about that and check it out.

so its not a problem 2 take the shorter opcodes if there ara just single scalar pultiplications.

i don't know, but maybe using both at once will be parallelized in the pipeline, so there ist also an advantage.. i have 2 read more about that and check it out.

I worked out the Newton-Raphson iterations for 1/x and 1/sqrt(x) for someone:

http://board.win32asmcommunity.net/showthread.php?s=80adfb5722b8039539ad2b43f5132f4e&threadid=12094

They might be useful here, as SSE/3dnow! do the same... This is how you can refine the outcome of the approximations.

http://board.win32asmcommunity.net/showthread.php?s=80adfb5722b8039539ad2b43f5132f4e&threadid=12094

They might be useful here, as SSE/3dnow! do the same... This is how you can refine the outcome of the approximations.