Fast aproxximations
category: code [glöplog]
I'm looking for the ultimate collection of trig and math function to run on the CPU that are at least of GPU accuracy (medium quality profile, gaming card).
nvidia posted a bunch of reference function, but I'm not sure they are the best
http://http.developer.nvidia.com/Cg/cos.html
Looking around I haven't found a single shrine to make sacrifices to the math hacker gods.
I have to check again, but I think Sony might have the most complete set.
Here are 3 I have so far... I would be interested to see better or more useful approximation, and get references source.
I think the sin() one is from "Nick" , not sure about the other two.
nvidia posted a bunch of reference function, but I'm not sure they are the best
http://http.developer.nvidia.com/Cg/cos.html
Looking around I haven't found a single shrine to make sacrifices to the math hacker gods.
I have to check again, but I think Sony might have the most complete set.
Here are 3 I have so far... I would be interested to see better or more useful approximation, and get references source.
I think the sin() one is from "Nick" , not sure about the other two.
Code:
const float pi_f = 3.14159274f;
__inline float _cbrt(float x) {
float y;
int i = *(int *)&x; // View x as an int.
i = i/4 + i/16; // Approximate divide by 3.
i = i + i/16;
i = i + i/256;
i = 0x2a5137a0 + i; // Initial guess.
y = *(float *)&i; // View i as float.
y = 0.33333333f*(2.0f*y + x/(y*y)); // Newton step.
y = 0.33333333f*(2.0f*y + x/(y*y)); // Newton step again.
return y;
}
__inline float _sin(float x) {
const float B = 4/pi_f;
const float C = -4/(pi_f*pi_f);
const float P = 0.225;
float y = B * x + C * x * abs(x);
y = P * (y * abs(y) - y) + y;
return y;
}
__inline float _acos(float n) {
const float h = 1.0f / 6.0f;
const float j = 3.0f / 40.0f;
const float k = 5.0f / 112.0f;
const float l = 35.0f / 1152.0f;
float u2 = n * n;
float u3 = u2 * n;
float u5 = u3 * u2;
float u7 = u5 * u2;
float u9 = u7 * u2;
return pi_f/2 - n - h * u3 - j * u5 - k * u7 - l * u9;
}
Yes, the sin is from Nick, there is even an improved quality version. Here's the original thread : http://www.devmaster.net/forums/showthread.php?t=5784
Also nicely works in fixed point :)
Also nicely works in fixed point :)
for my shaders I've used degree 5 equation of sinus and it gives power to my shaders to make sinus and cosinus.
Why? I was under the impression the GPU can do sin() and cos() ...
If you have proper GLSL code replacements for sin / cos / acos / atan etc. with appropriate accuracy + faster than using the intrinsics - please share those functions :)
it was for vs-1-1 coding with my own HLL but it may not be faster and needs another cst for modulo -pi,pi
here is a fast "approximation" of cos() :D
Code:
cos(float x)
{
return abs(abs(x)/PI*2 % 4 - 2) - 1;
}
ouch.
sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ...
cos(x) = 1 - x²/2! + x^4/4! - x^6/6! + ... - ...
cos(x) = 1 - x²/2! + x^4/4! - x^6/6! + ... - ...
Tigrou: you might wanna &3 instead of %4, yes?
every compiler nows that trick, no need to hardcore it.
MsK`: I guess my msvc compiler settings are all wrong then, since I have to hard code this in every 4k/1k intro :)
Code:
unsigned int D = GetTickCount();
if ( D % 8 == 0 )
printf("lulz");
if ( (D & 3) == 0 )
printf("lulz");
Code:
unsigned int D = GetTickCount();
00F1A993 call dword ptr [__imp__GetTickCount@0 (0F2105Ch)]
if ( D % 8 == 0 )
00F1A999 mov edi,dword ptr [__imp__printf (0F21270h)]
00F1A99F mov dword ptr [ebp-58h],eax
00F1A9A2 test al,7
00F1A9A4 jne HierarchicalMarchingCubes::Recurse+63h (0F1A9B3h)
printf("lulz");
00F1A9A6 push offset string "lulz" (0F65930h)
00F1A9AB call edi
00F1A9AD mov eax,dword ptr [ebp-58h]
00F1A9B0 add esp,4
if ( (D & 3) == 0 )
00F1A9B3 test al,3
00F1A9B5 jne HierarchicalMarchingCubes::Recurse+71h (0F1A9C1h)
printf("lulz");
00F1A9B7 push offset string "lulz" (0F65930h)
00F1A9BC call edi
I guess so :/
Around page 73 in this:
http://www.agner.org/optimize/optimizing_cpp.pdf
document there is a nice table of optimizations in different compilers.
http://www.agner.org/optimize/optimizing_cpp.pdf
document there is a nice table of optimizations in different compilers.
Out of interest, how is sin/cos/etc implemented on the GPU? Does the compiler break it down into something else? And what's the performance cost (in general that is).
psonice: since the alu is only able to do multiply-add, all trigonometric functions are replaced by taylor approximations at compile time . In the times of shader model 2 this let the shader code hit the maximum instruction count significantly earlier and table lookups where an order of magnitude faster.
If you have a rough idea of how much precision your operations actually require, it still makes perfectly sense to care about this manually.
If you have a rough idea of how much precision your operations actually require, it still makes perfectly sense to care about this manually.
I was always told that modern gpus do sin and cos in only one cycle, well, since the xbox360/ps3 era. Can they execute several madds in one cycle ?
Nick latest posted version.
A version can be made without the Wrap if the input is known to be -PI to PI
A version can be made without the normalizing if the input is already in the -1 to 1
From what I see this might be the uncontested SIN COS function for the CPU that require 'GPU class' precision?
If we all agree, lets move to the next one... pow() ?
I'm specially interested in pow(x, CONSTANT); mainly for gamma processing.
A version can be made without the Wrap if the input is known to be -PI to PI
A version can be made without the normalizing if the input is already in the -1 to 1
From what I see this might be the uncontested SIN COS function for the CPU that require 'GPU class' precision?
If we all agree, lets move to the next one... pow() ?
I'm specially interested in pow(x, CONSTANT); mainly for gamma processing.
Code:
float sin(float x)
{
const float Q = 3.1f;
const float P = 3.6f;
// Convert the input value to a range of -1 to 1
x = x * (1.0f / PI);
// Wrap around
volatile float z = (x + 25165824.0f);
x = x - (z - 25165824.0f);
float y = x - x * abs(x);
return y * (Q + P * abs(y));
}
hfr: that's, like, soooooo outdated info, man :)
Here is the SSE2 version of cuberoot.
I think this match GPU shader accuracy ?
I think this match GPU shader accuracy ?
Code:
const __m128 sign = _mm_castsi128_ps(_mm_set1_epi32(0x80000000));
const __m128 oneThird = _mm_set1_ps(0.33333333f);
const __m128i magic = _mm_set1_epi32(0x2a5137a0);
__inline __m128 _cbrt(__m128 x)
{
__m128 y, s;
__m128i i;
s = _mm_and_ps(x, sign); // sign(a)
x = _mm_xor_ps(x, s); // abs(a)
// Initial guess
i = _mm_castps_si128(x);
i = _mm_add_epi32(_mm_srai_epi32(i,2), _mm_srai_epi32(i,4));
i = _mm_add_epi32(i, _mm_srai_epi32(i,4));
i = _mm_add_epi32(i, _mm_srai_epi32(i,8));
i = _mm_add_epi32(i, magic);
y = _mm_castsi128_ps(i);
// Second & third iteration
y = _mm_mul_ps(oneThird, _mm_add_ps(_mm_add_ps(y,y), _mm_div_ps(x, _mm_mul_ps(y,y))));
y = _mm_mul_ps(oneThird, _mm_add_ps(_mm_add_ps(y,y), _mm_div_ps(x, _mm_mul_ps(y,y))));
// Restore sign
y = _mm_or_ps(y, s);
return y;
}
BTW, I almost fee lucky that Intel is not making us do some bit manipulation to do a subtract...
"You dont need a sub instruction, you can negate the sign and do an add"
"We gave have you sub... you want a HW version of sincos ??? who do you think we are GPU designers! "
me < no happy wasting time re-inventing the wheel.
"You dont need a sub instruction, you can negate the sign and do an add"
"We gave have you sub... you want a HW version of sincos ??? who do you think we are GPU designers! "
me < no happy wasting time re-inventing the wheel.
This is much faster than the intrinsic on some lowly GPUs (especially Videocore IV)
And heres the Python to calculate the coefficients, so you can try higher/lower orders:
This is based on the Sony paper
Code:
float sinf(float x){
x*=0.159155;
x-=floor(x);
float xx=x*x;
y=-6.87897;
y=y*xx+33.7755;
y=y*xx-72.5257;
y=y*xx+80.5874;
y=y*xx-41.2408;
y=y*xx+6.28077;
return x*y;
}
float cosf(float x){
return sinf(x+1.5708);
}
And heres the Python to calculate the coefficients, so you can try higher/lower orders:
Code:
from numpy import *
order=5
x=arange(0.00015,1,0.0001)
y=sin(2*pi*sqrt(x))/sqrt(x)
w=sqrt(x)/sin(sqrt(x))
coeffs=polyfit(x,y,deg=order,w=w)
poly=poly1d(coeffs)
print """
float sinf(float x)
{
x*="""+'%.6g'%(.5/pi)+""";
x-=floor(x);
float xx=x*x;"""
print " float y="+'%.6g'%poly[order]+";"
for i in range(order):
print " y=y*xx+"+'%.6g'%poly[order-i-1]+";"
print """ return x*y;
}
float cosf(float x) {
return sinf(x+"""+'%.6g'%(pi/2)+""");
}
"""
This is based on the Sony paper
@psionice To view some of the internal intrinsic implementations on Videocore IV, do a 'strings' on https://github.com/raspberrypi/firmware/blob/master/boot/start.elf and look near the end. Trig functions are all implemented in plain GLSL.
sin/cos/tan use a minimax polynomial followed squaring the sin/cos vector twice to multiply the angle by 4 and correct the vector length back 1.
asin/acos/atan use a CORDIC method - keep squaring the vector, and watch which quadrant it moves to.
sin/cos/tan use a minimax polynomial followed squaring the sin/cos vector twice to multiply the angle by 4 and correct the vector length back 1.
asin/acos/atan use a CORDIC method - keep squaring the vector, and watch which quadrant it moves to.
CAN I HAS FAST POW?