Fast aproxximations :: pouët.net

Fast aproxximations

category: code [glöplog]

I'm looking for the ultimate collection of trig and math function to run on the CPU that are at least of GPU accuracy (medium quality profile, gaming card).

nvidia posted a bunch of reference function, but I'm not sure they are the best
http://http.developer.nvidia.com/Cg/cos.html

Looking around I haven't found a single shrine to make sacrifices to the math hacker gods.
I have to check again, but I think Sony might have the most complete set.

Here are 3 I have so far... I would be interested to see better or more useful approximation, and get references source.
I think the sin() one is from "Nick" , not sure about the other two.

Code:


const float pi_f = 3.14159274f;

__inline float _cbrt(float x) {
   float y;
   int i = *(int *)&x;          // View x as an int.
   i = i/4 + i/16;              // Approximate divide by 3.
   i = i + i/16;
   i = i + i/256;
   i = 0x2a5137a0 + i;          // Initial guess.
   y = *(float *)&i;            // View i as float.
   y = 0.33333333f*(2.0f*y + x/(y*y));  // Newton step.
   y = 0.33333333f*(2.0f*y + x/(y*y));  // Newton step again.
   return y;
}

__inline float _sin(float x) {
    const float B = 4/pi_f;
    const float C = -4/(pi_f*pi_f);
    const float P = 0.225;

    float y = B * x + C * x * abs(x);
    y = P * (y * abs(y) - y) + y;
    return y;
}

__inline float _acos(float n) {
  const float h = 1.0f / 6.0f;
  const float j = 3.0f / 40.0f;
  const float k = 5.0f / 112.0f;
  const float l = 35.0f / 1152.0f;

  float u2 = n * n;
  float u3 = u2 * n;
  float u5 = u3 * u2;
  float u7 = u5 * u2;
  float u9 = u7 * u2;

  return pi_f/2 - n - h * u3 - j * u5 - k * u7 - l * u9;  
}

added on the 2012-11-23 11:37:04 by T21

Yes, the sin is from Nick, there is even an improved quality version. Here's the original thread : http://www.devmaster.net/forums/showthread.php?t=5784
Also nicely works in fixed point :)

added on the 2012-11-23 12:14:34 by MsK`

for my shaders I've used degree 5 equation of sinus and it gives power to my shaders to make sinus and cosinus.

added on the 2012-11-23 22:17:26 by Bartoshe

Why? I was under the impression the GPU can do sin() and cos() ...

added on the 2012-11-23 22:28:50 by trc_wm

If you have proper GLSL code replacements for sin / cos / acos / atan etc. with appropriate accuracy + faster than using the intrinsics - please share those functions :)

added on the 2012-11-23 23:16:21 by las

it was for vs-1-1 coding with my own HLL but it may not be faster and needs another cst for modulo -pi,pi

added on the 2012-11-24 00:07:22 by Bartoshe

here is a fast "approximation" of cos() :D

Code:

cos(float x) 
{ 
  return abs(abs(x)/PI*2 % 4 - 2) - 1; 
}

added on the 2012-11-24 00:50:34 by Tigrou

ouch.

added on the 2012-11-24 00:56:37 by Preacher

sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ...
cos(x) = 1 - x²/2! + x^4/4! - x^6/6! + ... - ...

added on the 2012-11-24 08:48:39 by Bartoshe

http://www.h-k.fr/publications/data/adc.ps__annexes.maths.pdf

added on the 2012-11-24 08:50:11 by Bartoshe

Tigrou: you might wanna &3 instead of %4, yes?

added on the 2012-11-24 08:51:52 by maytz

every compiler nows that trick, no need to hardcore it.

added on the 2012-11-24 11:39:06 by MsK`

MsK`: I guess my msvc compiler settings are all wrong then, since I have to hard code this in every 4k/1k intro :)

added on the 2012-11-24 11:58:24 by maytz

Code:

	unsigned int D = GetTickCount();
	if ( D % 8 == 0 )
		printf("lulz");
	if ( (D & 3) == 0 )
		printf("lulz");

Code:

	unsigned int D = GetTickCount();
00F1A993  call        dword ptr [__imp__GetTickCount@0 (0F2105Ch)]  
	if ( D % 8 == 0 )
00F1A999  mov         edi,dword ptr [__imp__printf (0F21270h)]  
00F1A99F  mov         dword ptr [ebp-58h],eax  
00F1A9A2  test        al,7  
00F1A9A4  jne         HierarchicalMarchingCubes::Recurse+63h (0F1A9B3h)  
		printf("lulz");
00F1A9A6  push        offset string "lulz" (0F65930h)  
00F1A9AB  call        edi  
00F1A9AD  mov         eax,dword ptr [ebp-58h]  
00F1A9B0  add         esp,4  
	if ( (D & 3) == 0 )
00F1A9B3  test        al,3  
00F1A9B5  jne         HierarchicalMarchingCubes::Recurse+71h (0F1A9C1h)  
		printf("lulz");
00F1A9B7  push        offset string "lulz" (0F65930h)  
00F1A9BC  call        edi

I guess so :/

added on the 2012-11-24 12:17:28 by MsK`

Around page 73 in this:
http://www.agner.org/optimize/optimizing_cpp.pdf
document there is a nice table of optimizations in different compilers.

added on the 2012-11-24 12:56:06 by kt

Out of interest, how is sin/cos/etc implemented on the GPU? Does the compiler break it down into something else? And what's the performance cost (in general that is).

added on the 2012-11-24 17:11:02 by psonice

psonice: since the alu is only able to do multiply-add, all trigonometric functions are replaced by taylor approximations at compile time . In the times of shader model 2 this let the shader code hit the maximum instruction count significantly earlier and table lookups where an order of magnitude faster.
If you have a rough idea of how much precision your operations actually require, it still makes perfectly sense to care about this manually.

added on the 2012-11-24 19:20:20 by hfr

I was always told that modern gpus do sin and cos in only one cycle, well, since the xbox360/ps3 era. Can they execute several madds in one cycle ?

added on the 2012-11-24 20:05:45 by MsK`

Nick latest posted version.
A version can be made without the Wrap if the input is known to be -PI to PI
A version can be made without the normalizing if the input is already in the -1 to 1

From what I see this might be the uncontested SIN COS function for the CPU that require 'GPU class' precision?
If we all agree, lets move to the next one... pow() ?

I'm specially interested in pow(x, CONSTANT); mainly for gamma processing.

Code:


float sin(float x)
{
    const float Q = 3.1f;
   const float P = 3.6f;

    // Convert the input value to a range of -1 to 1
    x = x * (1.0f / PI);

    // Wrap around
    volatile float z = (x + 25165824.0f);
    x = x - (z - 25165824.0f);

    float y = x - x * abs(x);
    return y * (Q + P * abs(y));
}

added on the 2012-11-24 22:08:04 by T21

hfr: that's, like, soooooo outdated info, man :)

added on the 2012-11-25 01:35:05 by kusma

Here is the SSE2 version of cuberoot.
I think this match GPU shader accuracy ?

Code:


const __m128 sign = _mm_castsi128_ps(_mm_set1_epi32(0x80000000));
const __m128 oneThird = _mm_set1_ps(0.33333333f);
const __m128i magic = _mm_set1_epi32(0x2a5137a0);

__inline __m128 _cbrt(__m128 x)
{
	__m128 y, s;
	__m128i i;
	
	s = _mm_and_ps(x, sign);	// sign(a)
	x = _mm_xor_ps(x, s);		// abs(a)

	// Initial guess
	i = _mm_castps_si128(x);
	i = _mm_add_epi32(_mm_srai_epi32(i,2), _mm_srai_epi32(i,4));
	i = _mm_add_epi32(i, _mm_srai_epi32(i,4));
	i = _mm_add_epi32(i, _mm_srai_epi32(i,8));
	i = _mm_add_epi32(i, magic);
	y = _mm_castsi128_ps(i);

	// Second & third iteration
	y = _mm_mul_ps(oneThird, _mm_add_ps(_mm_add_ps(y,y), _mm_div_ps(x, _mm_mul_ps(y,y))));
	y = _mm_mul_ps(oneThird, _mm_add_ps(_mm_add_ps(y,y), _mm_div_ps(x, _mm_mul_ps(y,y))));

	// Restore sign
	y = _mm_or_ps(y, s);

	return y;
}

added on the 2012-11-25 05:44:18 by T21

BTW, I almost fee lucky that Intel is not making us do some bit manipulation to do a subtract...
"You dont need a sub instruction, you can negate the sign and do an add"

"We gave have you sub... you want a HW version of sincos ??? who do you think we are GPU designers! "

me < no happy wasting time re-inventing the wheel.

added on the 2012-11-25 05:54:32 by T21

This is much faster than the intrinsic on some lowly GPUs (especially Videocore IV)

Code:


float sinf(float x){
    x*=0.159155;
    x-=floor(x);
    float xx=x*x;
    y=-6.87897;
    y=y*xx+33.7755;
    y=y*xx-72.5257;
    y=y*xx+80.5874;
    y=y*xx-41.2408;
    y=y*xx+6.28077;
    return x*y;
}
float cosf(float x){
    return sinf(x+1.5708);
}

And heres the Python to calculate the coefficients, so you can try higher/lower orders:

Code:


from numpy import *
order=5
x=arange(0.00015,1,0.0001)
y=sin(2*pi*sqrt(x))/sqrt(x)
w=sqrt(x)/sin(sqrt(x))
coeffs=polyfit(x,y,deg=order,w=w)
poly=poly1d(coeffs)
print """
float sinf(float x)
{
  x*="""+'%.6g'%(.5/pi)+""";
  x-=floor(x);
  float xx=x*x;"""
print "  float y="+'%.6g'%poly[order]+";"
for i in range(order):
    print "  y=y*xx+"+'%.6g'%poly[order-i-1]+";"
print """  return x*y;
}
float cosf(float x) {
  return sinf(x+"""+'%.6g'%(pi/2)+""");
}
"""

This is based on the Sony paper

added on the 2012-11-25 13:21:24 by GoingDigital

@psionice To view some of the internal intrinsic implementations on Videocore IV, do a 'strings' on https://github.com/raspberrypi/firmware/blob/master/boot/start.elf and look near the end. Trig functions are all implemented in plain GLSL.
sin/cos/tan use a minimax polynomial followed squaring the sin/cos vector twice to multiply the angle by 4 and correct the vector length back 1.
asin/acos/atan use a CORDIC method - keep squaring the vector, and watch which quadrant it moves to.

added on the 2012-11-25 14:00:52 by GoingDigital

CAN I HAS FAST POW?

added on the 2012-11-25 17:22:56 by xernobyl

Fast aproxximations

login