Fast aproxximations
category: code [glöplog]
@xernobyl Whats your platform?
If your exponent is an integer, it might be faster to do repeated multiplication, particularly if your exponent is fixed. For maximum speed, repeatedly square your parameter to get powers of 2^n, then multiply together looking at the bit pattern of the exponent parameter.
For any exponent, you need to dig into the floating point format.
pow(a,b)=exp2(log2(a)*b)
log2(a) = log2(mantissa_a) + exponent_a
exp2(c) = exponent (top bits of c) and mantissa (bottom bits of c)
You can speed up log2 and exp2 by using tables or less accurate polynomial approximations for the mantissa<->logarithm conversions.
But this depends on how easy it is to access the floating point representation directly. You might lose a lot of speed there.
If your exponent is an integer, it might be faster to do repeated multiplication, particularly if your exponent is fixed. For maximum speed, repeatedly square your parameter to get powers of 2^n, then multiply together looking at the bit pattern of the exponent parameter.
For any exponent, you need to dig into the floating point format.
pow(a,b)=exp2(log2(a)*b)
log2(a) = log2(mantissa_a) + exponent_a
exp2(c) = exponent (top bits of c) and mantissa (bottom bits of c)
You can speed up log2 and exp2 by using tables or less accurate polynomial approximations for the mantissa<->logarithm conversions.
But this depends on how easy it is to access the floating point representation directly. You might lose a lot of speed there.
sin(x) ~ x near 0 (o²)
I implemented approximated sin when I made a demo with SSE.
My idea how to make a approximated sin/cos function:
Approximate it with polynomial:
f(x) = a0*x^0 + a1*x^1 + a2*x^2 + ... + an*x^n
Approximating whole sin function with a signle polynomial is impractical.
So approximate a part of sin (e.g. x=[-PI, PI])
and [mirror] copy them for other x.
There are many choices when you make a aprxd function.
First choice is a "n"(n is a natural number).
Next choices is how to get (a0, a1, ... an).
When you got n and (a0, a1, ... an), you got a function.
Larger n makes more precise function but require more calculations.
n=2~3 would be enough for some application which don't require high precision.
You get (a0, a1, ... an) by making a restrictions and solving the system of equations.
Example of restrictions:
f(0) = 0, f(PI/2) = 1
f'(0) = 1, f'(PI/2) = 0
f''(0) = 0, f''(PI/2) = -1
Make a choise by considering to how sin in your code works.
sin can be used for synthesis sound, drawing circle or line, make animation, setting rotation matrix, plasma, random number gen, etc.
After you make enough number of restrictions to make a solvable system of equations,
solve it with pen and paper.
You get (a0, a1, ... an) and your original approximated sin!
My idea how to make a approximated sin/cos function:
Approximate it with polynomial:
f(x) = a0*x^0 + a1*x^1 + a2*x^2 + ... + an*x^n
Approximating whole sin function with a signle polynomial is impractical.
So approximate a part of sin (e.g. x=[-PI, PI])
and [mirror] copy them for other x.
There are many choices when you make a aprxd function.
First choice is a "n"(n is a natural number).
Next choices is how to get (a0, a1, ... an).
When you got n and (a0, a1, ... an), you got a function.
Larger n makes more precise function but require more calculations.
n=2~3 would be enough for some application which don't require high precision.
You get (a0, a1, ... an) by making a restrictions and solving the system of equations.
Example of restrictions:
f(0) = 0, f(PI/2) = 1
f'(0) = 1, f'(PI/2) = 0
f''(0) = 0, f''(PI/2) = -1
Make a choise by considering to how sin in your code works.
sin can be used for synthesis sound, drawing circle or line, make animation, setting rotation matrix, plasma, random number gen, etc.
After you make enough number of restrictions to make a solvable system of equations,
solve it with pen and paper.
You get (a0, a1, ... an) and your original approximated sin!
Quote:
hfr: that's, like, soooooo outdated info, man :)
Kusma, probably true. So what do you think is different now? (except that you don't use lookups anymore).
Dunno if all of these are faster, what the error is etc... Go try them.
plus some more stuff here
Code:
//from: http://www.devmaster.net/forums/showthread.php?t=5784
inline float cosf(float x)
{
//bring into range [-1,1]
x = (x + (float)M_PI_2) * (float)M_1_PI;
//wrap around
volatile float z = (x + 25165824.0f); //must be volatile else it will be optimized out
x -= (z - 25165824.0f);
//low precision approximation: return 4.0f * (x - x * fabs(x));
//higher precision path
float y = x - x * fabs(x);
//next iteration for higher precision
const float Q = 3.1f;
const float P = 3.6f;
return y * (Q + P * fabs(y));
}
Code:
//from: http://www.devmaster.net/forums/showthread.php?t=5784
inline float sinf(float x)
{
//bring into range [-1,1]
x *= (float)M_1_PI;
//wrap around
volatile float z = (x + 25165824.0f); //must be volatile else it will be optimized out
x -= (z - 25165824.0f);
//low precision approximation: return 4.0f * (x - x * fabs(x));
//higher precision path
float y = x - x * fabs(x);
//next iteration for higher precision
const float Q = 3.1f;
const float P = 3.6f;
return y * (Q + P * fabs(y));
}
Code:
inline float acosf(float x)
{
return sqrtf(1.0f-x) * (1.5707963267f + x*(-0.213300989f + x*(0.077980478f + x*-0.02164095f)));
}
Code:
inline float asinf(float x)
{
return sqrtf(1.0f-x) * (1.5707963267f + x*(-0.213300989f + x*(0.077980478f + x*-0.02164095f))) + (float)M_PI_2;
}
Code:
inline float invsqrtf(float x)
{
float xhalf = 0.5f * x;
int i=*(int*)&x;
i = 0x5f3759df - (i >> 1);
x = *(float*)&i;
x *= (1.5f - xhalf * x * x);
//x *= (1.5f - xhalf * x * x); repeat Newton-Raphson step for higher precision
return x;
}
Code:
inline float fabs(float x)
{
int f = ((*(int *)(&x)) & 0x7FFFFFFF);
return *(float *)(&f);
}
Code:
//from: http://ldesoras.free.fr/doc/articles/rounding_en.pdf
inline float ceilf(float x)
{
static float roundTowardsPI = -0.5f; //use 0.5f to round to nearest integer
int i;
__asm {
fld x
fadd st, st
fsubr roundTowardsPI
fistp i
sar i, 1
neg i
}
return (float)i;
}
Code:
//from: http://ldesoras.free.fr/doc/articles/rounding_en.pdf
inline float floorf(float x)
{
static float roundTowardsMI = -0.5f; //use 0.5f to round to nearest integer
int i;
__asm {
fld x
fadd st, st
fadd roundTowardsMI
fistp i
sar i, 1
}
return (float)i;
}
Code:
//from: http://ldesoras.free.fr/doc/articles/rounding_en.pdf
inline float truncf(float x)
{
#if defined(USE_SSE3) || defined(USE_SSE4)
float i;
__asm {
fld x
fisttp i
}
return i;
#else
static float roundTowardsMI = -0.5f; //use 0.5f to round to nearest integer
int i;
__asm {
fld x
fadd st, st
fabs
fadd roundTowardsMI
fistp i
sar i, 1
}
return (float)(x < 0 ? -i : i);
#endif
}
[
Code:
//From: http://www.iquilezles.org/www/articles/sfrand/sfrand.htm
float randf(int * seed)
{
float res;
seed[0] *= 16807;
*((unsigned int *) &res) = ( ((unsigned int)seed[0])>>9 ) | 0x40000000;
return(res-3.0f);
}
plus some more stuff here
@xernobyl/GoingDigital: On nowadays GPUs there is native exp2 and log2 support internally, so pow is basically for "free" if you have enough stuff happening around that code..
@T21: in your cbrt, why do you use an approximation for dividing by 3? just divide by 3 and then look at the generated asm.. pooof.. it vanished.. :)
@toxie Division is much more expensive than multiply on some platforms. Not sure what T21's on, but for Raspberry Pi GPU here are rough timings from experiments so far:
multiply/add/subtract : 0.5/1 cycle
exp/log: 10 cycles (definitely not free)
division/sqrt: 15 cycles ouch!
trig: 15-30 cycles
Big speedups to be had by multiplying reciprocals rather than dividing.
multiply/add/subtract : 0.5/1 cycle
exp/log: 10 cycles (definitely not free)
division/sqrt: 15 cycles ouch!
trig: 15-30 cycles
Big speedups to be had by multiplying reciprocals rather than dividing.
To me "nowadays GPUs" has an implicit "high-end" somewhere in there :)
nowadays GPUs includes 4 year old cards (and i think even older than that), so.. ;)
@GoingDigital: (constant) divides can be transformed into multiplies (and should be by every smart compiler nowadays), see http://www.hackersdelight.org/magic.htm
@GoingDigital: (constant) divides can be transformed into multiplies (and should be by every smart compiler nowadays), see http://www.hackersdelight.org/magic.htm
Quote:
divides can be transformed into multiplies (and should be by every smart compiler nowadays)
Sure, but it's numerically different and must enabled in the compiler settings.
The code in question approximates a 32bit integer division, though (which requires a 64bit multiply).
hfr: the fragment shaders does more than just mujltiply add these days, so no series expansion any more. For instance, all modern NV shader-units do trig as two-instruction pairs (presin/sin etc).
Nowadays platforms are a mix of desktop and mobile, so the raspberry pi gpu actually IS pretty much a current gen gpu. If you're working on such platforms (I am, so i'm very interested in this.. although I've already optimised out all the trig in all my current stuff as it happens) it's definitely relevant :)
I'll have to benchmark the current ios devices for this stuff at some point, it'd definitely help with optimising.
I'll have to benchmark the current ios devices for this stuff at some point, it'd definitely help with optimising.
toxiw: it doesn't just vanish, but it does look better :)
But I dont think the same mul shift trick can be done with SSE ?
We would need a _mm_mulhi_epi32() ? that doesn't exist
(Ultimately, all those would become SSE version to work on vec4f)
I'm looking to collecting a set of math function that gives me GPU level precision, as fast as possible on the CPU.
But I dont think the same mul shift trick can be done with SSE ?
We would need a _mm_mulhi_epi32() ? that doesn't exist
(Ultimately, all those would become SSE version to work on vec4f)
I'm looking to collecting a set of math function that gives me GPU level precision, as fast as possible on the CPU.
The first link doesn't seem to even tackle sin/cos, and does this for rsqrt() : div(sqrt(x))
This breaks my goal of "GPU level precision, as fast as possible on the CPU"
Their is no shortage of crazy stupid approximation, as well as written high precision approximation.
The trick is collecting the SOTA function scattered all over the place.
Look at the thread on sin() approximation.. conclusion "Dont use newton rapson serie) , what does 99% of the code on 'google' does?
I think I found two so far a sin() and cbrt()
My next one will be pow() (via exp log2)
Plenty of reference, no shortage of that. Well, thanks for even more inks :)
My point again. Its almost 2013, I wished this would have been settled by now.
This breaks my goal of "GPU level precision, as fast as possible on the CPU"
Their is no shortage of crazy stupid approximation, as well as written high precision approximation.
The trick is collecting the SOTA function scattered all over the place.
Look at the thread on sin() approximation.. conclusion "Dont use newton rapson serie) , what does 99% of the code on 'google' does?
I think I found two so far a sin() and cbrt()
My next one will be pow() (via exp log2)
Plenty of reference, no shortage of that. Well, thanks for even more inks :)
My point again. Its almost 2013, I wished this would have been settled by now.
Code:
The first link doesn't seem to even tackle sin/cos, and does this for rsqrt() : div(sqrt(x))
This breaks my goal of "GPU level precision, as fast as possible on the CPU"
It is actually ~50% of your goal.
Code:
I think I found two so far a sin() and cbrt()
My next one will be pow() (via exp log2)
You've found cos(), pow(), and invsqrt() too...
Please do post a link to your stuff when you're done, for a definitive answer for everybody...
or "quote"... :/
Also take a look at this PDF.
kusma: thanks, that's a useful doc (the SGX one, no need for the others yet) :)
Actually, I should have thought of this before: imgtec's shader editor gives cycled counts per line. With a little test program (compiler set to SGX543):
Low precision:
Medium or high precision:
Medium / high precision looks faster, but I suspect it's because the GPU is combining processing units - so you get higher shader throughput, but less shader units doing the work, so it's slower overall. However, I'll test that to make sure at some point - faster sin/cos could be useful at some point.
On the SGX 535 the results were pretty similar, sin/cos a bit slower (~17), asin/acos slower at ~30.
So really it's only the trig stuff that's slow, and it's probably worth optimising out where possible on these GPUs.
Actually, I should have thought of this before: imgtec's shader editor gives cycled counts per line. With a little test program (compiler set to SGX543):
Low precision:
Code:
Sin: 14 cycles
Cos: 15 cycles
Tan: 18
asin: ? (Probably 22 - the line count is missing, but 22 cycles get added at the end)
acos: (probably 22)
atan: (probably 32)
pow: 3
exp: 3
MADD: 1
/: 2
min, max: 1
Medium or high precision:
Code:
Sin: 10 cycles
Cos: 11 cycles
Tan: 13
asin: ? (Probably 22)
acos: (probably 21)
atan: (probably 33)
pow: 3
exp: 3
MADD: 1
/: 2
min, max: 1
Medium / high precision looks faster, but I suspect it's because the GPU is combining processing units - so you get higher shader throughput, but less shader units doing the work, so it's slower overall. However, I'll test that to make sure at some point - faster sin/cos could be useful at some point.
On the SGX 535 the results were pretty similar, sin/cos a bit slower (~17), asin/acos slower at ~30.
So really it's only the trig stuff that's slow, and it's probably worth optimising out where possible on these GPUs.
raer, so you see my point :) The info is out there, but all over the place. Hence the thread.
From your link (I do read them when I have the time) this was proposed to be more accurate then what you posted.
max rel_error
0.001752 original 0x5F3759DF
0.000650 Řrřola 0x5F1FFFF9
0.000366 rsqrtss
From your link (I do read them when I have the time) this was proposed to be more accurate then what you posted.
Code:
float inv_sqrt(float x)
{ union { float f; uint32 u; } y = {x};
y.u = 0x5F1FFFF9ul - (y.u >> 1);
return 0.703952253f * y.f * (2.38924456f - x * y.f * y.f);
}
max rel_error
0.001752 original 0x5F3759DF
0.000650 Řrřola 0x5F1FFFF9
0.000366 rsqrtss
abs alternative (only one mul !)
Code:
float abs(float x)
{
return sqrt(x*x);
}
FAST!11!!!!ONE1!
Somehow related http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0363d/CJAJIDFD.html
the table at the bottom might be useful for some.
the table at the bottom might be useful for some.