Optimizing Closure
category: code [glöplog]
The wonderful things about there being no standard for binary intermediates for shaders (so far as I know) is that shaders for comercial games can be opened up in any text editor. I'm not sure if it's appropriate to dump code for a copyrighted game here, but if that is the case I imagine deleting this thread would not be to difficult for the admins here.
This is updated version of a flash game I really liked, but the dev seems to have solved his art/style objectives by brute force IMO.
{Copyright Eyebrow of course
Any /* */ are my additional commentary}
This is "hinnerglow.cg"
I take it this means 15 texture sample per pixel;
No wonder it will not run resaonably on the Radeon 4200 IGP in my laptop.
Just wow... I have zero shading experience, so is this a normal way for a dev to do what he is trying to do? Would a typical shader compiler make all those constant divisions into multiplies?
I can already imagine 1.0/scale.x could be done once instead of 15 times.
Any other ideas?
This is updated version of a flash game I really liked, but the dev seems to have solved his art/style objectives by brute force IMO.
{Copyright Eyebrow of course
Any /* */ are my additional commentary}
This is "hinnerglow.cg"
Code:
const float px = 1.0/1920.0*2.0*1.3;
const float py = 1.0/1080.0*2.0*1.3;
/*Designed specifically for 1080p!? */
void main (float2 texCoord : TEXCOORD0,
sampler2D tex : TEXTUNIT0,
uniform float2 scale,
out float4 oColor : COLOR)
{
float glow = 0.0;
glow += tex2D(tex, float2(texCoord) + float2(0.0*px/scale.x, 0.0)).r * (1.0/(0.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(1.0*px/scale.x, 0.0)).r * (1.0/(1.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(2.0*px/scale.x, 0.0)).r * (1.0/(4.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(3.0*px/scale.x, 0.0)).r * (1.0/(9.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(4.0*px/scale.x, 0.0)).r * (1.0/(16.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(5.0*px/scale.x, 0.0)).r * (1.0/(25.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(6.0*px/scale.x, 0.0)).r * (1.0/(36.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(7.0*px/scale.x, 0.0)).r * (1.0/(49.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(-1.0*px/scale.x, 0.0)).r * (1.0/(1.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(-2.0*px/scale.x, 0.0)).r * (1.0/(4.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(-3.0*px/scale.x, 0.0)).r * (1.0/(9.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(-4.0*px/scale.x, 0.0)).r * (1.0/(16.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(-5.0*px/scale.x, 0.0)).r * (1.0/(25.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(-6.0*px/scale.x, 0.0)).r * (1.0/(36.0+1.0));
glow += tex2D(tex, float2(texCoord) + float2(-7.0*px/scale.x, 0.0)).r * (1.0/(49.0+1.0));
float b = (glow/2.888);
oColor = float4(b, b, b, b);
}
I take it this means 15 texture sample per pixel;
No wonder it will not run resaonably on the Radeon 4200 IGP in my laptop.
Just wow... I have zero shading experience, so is this a normal way for a dev to do what he is trying to do? Would a typical shader compiler make all those constant divisions into multiplies?
I can already imagine 1.0/scale.x could be done once instead of 15 times.
Any other ideas?
if it's a simple gaussian blur (from what I can see), you can just do it with less samples but to a smaller rendertarget, and then your texture filtering will add an additional step of blur for free.
At least they are using a separable version of it - which is complexity wise a pretty good idea they are not going total bruteforce there ;)
This might be another good idea to speed things up:
http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/
This might be another good idea to speed things up:
http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/
I always end up down scaling to get blur. I take the average of 4 pixels. Do this a few times until I get like image_height<threshold. Then apply one blur to the final down scaled image. Is there any reason not to use this? Isn't it faster than anything else? (never tested it against anything else)
do not modify the lookup coords in fragment shader. it should be faster if you calculate the sample points in vs and pass them to fs. this should allow the gpu to do the texture lookups 'before fs' as the lookup positions do not depend on your fs code.
pretty good answers here. question : are some of you involved in game development ? or have just make too much demos ?
Rale: You mean to compute the sample positions in the VS of a fullscreen quad(/triangle) and store them into varyings and than let the interpolation compute all your samples on the fly? Not a bad idea.
But I don't think that the sample position computation is the main bottleneck - it might be a good additional optimization - especially if you target low end hardware/mobile.
The main bottleneck are still the texture lookups and basically all major optimizations for that have been mentioned here already. (Separation of the kernel, downsampling, abusing linear sampling to get 2 samples instead of 1 - without separation along X/Y you could even go for 4 samples but for larger kernel sizes the computational complexity will show it's ugly face)
Besides that - there are nice approaches for box filters with almost arbitrary kernel size by using a summation table.
No / Maybe. And other reasons. ;)
But I don't think that the sample position computation is the main bottleneck - it might be a good additional optimization - especially if you target low end hardware/mobile.
The main bottleneck are still the texture lookups and basically all major optimizations for that have been mentioned here already. (Separation of the kernel, downsampling, abusing linear sampling to get 2 samples instead of 1 - without separation along X/Y you could even go for 4 samples but for larger kernel sizes the computational complexity will show it's ugly face)
Besides that - there are nice approaches for box filters with almost arbitrary kernel size by using a summation table.
Quote:
question : are some of you involved in game development ? or have just make too much demos ?
No / Maybe. And other reasons. ;)
rale: true for certain (mobile) GPUs only. :)
:)
@rale, las, smash: Yeah, that sounds totally like black magic to me.
graga: no, its very sound advice on a certain kind of gpu.
So to sum it up: You don't want to do dependent texture lookups in fragment shader on deferred rendering GPUs.
It is in a way funny to see you all talking about "a certain mobile GPU" and "deferred rendering GPU" - strong NDAs anyone?
I'm not under NDA and I guess I could name the thing we are talking about - but let's leave it as an exercise for the interested reader to figure out what certain mobile GPU is bad in doing these thing.
Back to it - the closure guys are not doing it horribly wrong - so I recommend you to pick the hardware solution or just fix those shaders to do almost nothing - you wont have nice glow stuff than - but at least you could play the game.
I'm not under NDA and I guess I could name the thing we are talking about - but let's leave it as an exercise for the interested reader to figure out what certain mobile GPU is bad in doing these thing.
Back to it - the closure guys are not doing it horribly wrong - so I recommend you to pick the hardware solution or just fix those shaders to do almost nothing - you wont have nice glow stuff than - but at least you could play the game.
another opt you might want to try is to sort the sampled left to right. for cache purposes. it helped me with SSAO in the past.
it would be cool if GPUs had some sort of prefetch, like the _mm_prefetch in SSE, so that when doing your blur you'd start touching sample number 7 first before proceeding to accumulate samples 0, 1, 2, 3, 4, 5, 6, so that when you continued with samples 8, 9, 10, 11, 12, 13 these data would be already in cache ready for sampling thanks to the initial touch to sample 7? or perhaps GPU caches don't work this way anyway, i've no idea what i'm talking about really.
but yeah, sorting left to right helped me a bit in the past.
it would be cool if GPUs had some sort of prefetch, like the _mm_prefetch in SSE, so that when doing your blur you'd start touching sample number 7 first before proceeding to accumulate samples 0, 1, 2, 3, 4, 5, 6, so that when you continued with samples 8, 9, 10, 11, 12, 13 these data would be already in cache ready for sampling thanks to the initial touch to sample 7? or perhaps GPU caches don't work this way anyway, i've no idea what i'm talking about really.
but yeah, sorting left to right helped me a bit in the past.
IQ: I kinda of lost you somewhere... what are you trying to sort?
that's a good one @iq. like sampling the scanline with a linear memory precache. i#m sure but i dunno if the compiler optimizes that anyway. ;)
Can someone please explain?
Read up on how caches work.
Basically it works like this: when reading stuff from memory, depending on the hardware, extra data in memory after the one you accessed are transferred into the processor cache for faster access (so it's stored locally and you don't need to access the bus to get it). So when you sample the texture in order from left to right the GPU might (I don't know enough of the hardware specifics to be more exact) have already the texels in its internal cache and thus save some cycles that would otherwise be spent on fetching.
Basically it works like this: when reading stuff from memory, depending on the hardware, extra data in memory after the one you accessed are transferred into the processor cache for faster access (so it's stored locally and you don't need to access the bus to get it). So when you sample the texture in order from left to right the GPU might (I don't know enough of the hardware specifics to be more exact) have already the texels in its internal cache and thus save some cycles that would otherwise be spent on fetching.
gpu used to do texture swizzling, right?
Nice! I want to believe that the sharer compiler is smart enough to do it internally for above code...
iq: that's difficult to assume these days because there's so much running on the gpu at any point in time. when that first texture read occurs in the shader what actually happens is there's like 63 other shader threads all doing the same instruction but on different parts of the texture.
you gain some benefit from the cache here because some of those different threads hit the same cache lines, but something like prefetch would be pointless on a modern desktop architecture.
modern desktop gpu architectures make very low assumptions about cache hit rates. the way gpus reduce the impact of reads from memory on performance is through latency hiding: having lots and lots of jobs in flight to choose from at any time, and being able to swap jobs as soon as one job stalls on a memory access.
you gain some benefit from the cache here because some of those different threads hit the same cache lines, but something like prefetch would be pointless on a modern desktop architecture.
modern desktop gpu architectures make very low assumptions about cache hit rates. the way gpus reduce the impact of reads from memory on performance is through latency hiding: having lots and lots of jobs in flight to choose from at any time, and being able to swap jobs as soon as one job stalls on a memory access.
mu6k: the downsampling isnt necessary anymore nowadays!
i just use the full res and apply like 128 steps of HypnoGlow to it f.e.
all of this is almost for free, doesnt hit the framerate at all! all on GPU via HLSL ofcourse!
i just use the full res and apply like 128 steps of HypnoGlow to it f.e.
all of this is almost for free, doesnt hit the framerate at all! all on GPU via HLSL ofcourse!
Quote:
all of this is almost for free, doesnt hit the framerate at all!
I tend to disagree. Only because it works for your 4k intros with a trivial glow post processing at real-time framerates - it is "not for free" nor even "almost for free".
Blurs are(can be - you can always screw things up) fast nowadays - agreed - but not free at all. You can get away with bruteforce (on fast enough hardware and depending what you want to do) - but maybe you should have read this thread - it was about speed optimizations.
yes, i didnt read the thread to the end when i posted, but read my post again carefully: i answered mu6k on his Q! nothing more!