GLSL Multithreaded Shader Compilation on NVIDIA
category: code [glöplog]
Hi.
Target Platform: Machine with latest NV Drivers (335.23)
Problem: Tons of complicated shaders need to be compiled (fast).
Straight forward solution: Do it multicore, it's just shader compilation, what could possibly go wrong...
Well...
By now I attempted several things (Including the creation of fake contexts for each thread and moving program binaries around). I have the suspicion that the driver serializes all the shader compiles - no matter what dirty trick I try.
Of course I also attempted this (end of the page) - no way.
Any ideas and/or more information what the NV driver actually does(n't)?
P.S.: For benchmarks %APPDATA%\NVIDIA\GLCache\ should be deleted before every test, otherwise your shaders are already kind of precompiled.
Target Platform: Machine with latest NV Drivers (335.23)
Problem: Tons of complicated shaders need to be compiled (fast).
Straight forward solution: Do it multicore, it's just shader compilation, what could possibly go wrong...
Well...
By now I attempted several things (Including the creation of fake contexts for each thread and moving program binaries around). I have the suspicion that the driver serializes all the shader compiles - no matter what dirty trick I try.
Of course I also attempted this (end of the page) - no way.
Any ideas and/or more information what the NV driver actually does(n't)?
P.S.: For benchmarks %APPDATA%\NVIDIA\GLCache\ should be deleted before every test, otherwise your shaders are already kind of precompiled.
sure fucked up. interesting. it's valid. the compilers should run as task in another thread.
the 'easiest' is to write your own compiler and feed your own threads.
the 'easiest' is to write your own compiler and feed your own threads.
yumeji: write your own compiler for GLSL? Wat.
las: How about launching multiple processes that pre-compile the shader, hoping that the shader-cache kicks in when the "main" program is run?
las: Also, try to avoid checking the compile-status right away. That *might* actually be what causes the driver to serialize.
Let's start with a simple question: Did anyone here get multithreaded GLSL shader compilation to work properly?
@kusma: I tried exactly these approaches before the initial post. Maybe I was tired and totally screwed up everything (shit happens) - but I don't think so.
@spike:
NOT EQUAL TO
Any information on whether this works on any existing driver implementation?
And it's basically pretty much the same logic which is in the How to ensure that an OpenGL implementation will be able to build GLSL shaders in parallels? paragraph of the link I posted before, which didn't seem to change anything.
@kusma: I tried exactly these approaches before the initial post. Maybe I was tired and totally screwed up everything (shit happens) - but I don't think so.
@spike:
Quote:
Here is a path which would enable a GL driver to do parallel shader compiling and linking
NOT EQUAL TO
Quote:
Here is a path which enables the NVIDIA GL driver to do parallel shader compiling and linking
Any information on whether this works on any existing driver implementation?
And it's basically pretty much the same logic which is in the How to ensure that an OpenGL implementation will be able to build GLSL shaders in parallels? paragraph of the link I posted before, which didn't seem to change anything.
@las: Timothy Lottes used to work for NVIDIA, so it wouldn't be too far fetched to expect it to apply there.
Interestingly though, this exact issue was very recently brought up on the Mesa mailing list as well. In fact, it was posted there almost at exactly same time as you posted here. Coincidence? Ian, are you reading this? ;)
Interestingly though, this exact issue was very recently brought up on the Mesa mailing list as well. In fact, it was posted there almost at exactly same time as you posted here. Coincidence? Ian, are you reading this? ;)
@kusma: I know.
As mentioned before: I tried it - it's basically the same approach as mentioned in the g-trunc post and it does not seem to work.
Absolute coincidence - since I was just trying to overcome an existing problem... ;)
As mentioned before: I tried it - it's basically the same approach as mentioned in the g-trunc post and it does not seem to work.
Absolute coincidence - since I was just trying to overcome an existing problem... ;)
las, just making sure, you tried kusmas multi-processes thing and compile time was the same as single process?
If you had actually read my previous posts, you should be able to come to that conclusion.
Again: The approach Lottes proposes is basically the same thing as in the g-trunc post and it does not even ask you to do any fancy threading stuff in your application.
It's just the way you should call the OpenGL API to give the driver the chance to handle the shader compilation in parallels. So this potentially enables multithreaded compilation on the driver side. If I'm not mistaken, Lottes does not mention at all whether this actually works on NVIDIA Drivers or not. Seems it doesn't with the current one.
It also seems that there is little information available on these issues. It's 2014, come on...
If I have the time, I'll throw something together - some kind of minimal reproducer and make that available for testing, but I'll be pretty busy the next weeks.
If somebody gets some positive results compilation time wise - please let me know. Probably this stuff just doesn't work properly with the OpenGL compatibility contexts that I'm using.
Again: The approach Lottes proposes is basically the same thing as in the g-trunc post and it does not even ask you to do any fancy threading stuff in your application.
It's just the way you should call the OpenGL API to give the driver the chance to handle the shader compilation in parallels. So this potentially enables multithreaded compilation on the driver side. If I'm not mistaken, Lottes does not mention at all whether this actually works on NVIDIA Drivers or not. Seems it doesn't with the current one.
It also seems that there is little information available on these issues. It's 2014, come on...
If I have the time, I'll throw something together - some kind of minimal reproducer and make that available for testing, but I'll be pretty busy the next weeks.
If somebody gets some positive results compilation time wise - please let me know. Probably this stuff just doesn't work properly with the OpenGL compatibility contexts that I'm using.
Las, I might be missing something, but it seems that multi-process parallelization works for me.
1 process => ~13s
2 processes => ~14s
4 processes => ~16s
6 processes => ~25s
Platform: Quad Core, GTX-560, NV Drivers 335.23, Windows 7 x86
Just for reference
1 process => ~13s
2 processes => ~14s
4 processes => ~16s
6 processes => ~25s
Platform: Quad Core, GTX-560, NV Drivers 335.23, Windows 7 x86
Just for reference
Code:
// Get shaders
string fragShader = "... fragment shader code ..." ;
string vertexShader = "... vertex shader code ...";
// Compile loop
int s1 = GetTickCount();
for (int i = 0; i < 200;i++)
{
// Add some changing comment to shader (to fuck up with cache)
char postfix[12];
sprintf(postfix, "// %d %d \r\n", i, s1);
// Compile shader
compileAndLinkShader(vertexShader, fragShader + postfix);
}
int s2 = GetTickCount();
// Print time
cout << (s2-s1) / 1000.0;
Quote:
Las, I might be missing something, but it seems that multi-process parallelization works for me.
1 process => ~13s
2 processes => ~14s
4 processes => ~16s
6 processes => ~25s
TLM, I might be missing something, but shouldn't timings get lower with increasing process count?
i'm guessing he's compiling 200 shaders in each process, so it's doing 6x more shaders in ~2x more time in the last case. So many other things could be affecting the times there though, i'm not sure it's safe to say the actual compilation is happening in parallel.
Ah now I understand, thanks psonice
psonice, you are basicly correct. Couple of notes:
1. It seems to scale nicely with the number of CPUs/cores.
2. Regard the "not sure it's safe to say the actual compilation is happening in parallel", I agree, its hard to tell what is really going on. Just consider this: not clearing the shader cache makes things super fast.
3. If you just change the sprintf to "sprintf(postfix, "// %d \r\n", i);" the total time get faster by a factor of x4, this happens since as all processes are trying to compile the same set of 200 shaders. Each process tries to compile the entire list, but in fact in some of the cases it may load from cache some shader that was compiled a second ago by a different process. This makes the entire thing really simple to implement, you don't need to manually transfer bin shader code between processes, you can just trust the shader cache to do this for you.
1. It seems to scale nicely with the number of CPUs/cores.
2. Regard the "not sure it's safe to say the actual compilation is happening in parallel", I agree, its hard to tell what is really going on. Just consider this: not clearing the shader cache makes things super fast.
3. If you just change the sprintf to "sprintf(postfix, "// %d \r\n", i);" the total time get faster by a factor of x4, this happens since as all processes are trying to compile the same set of 200 shaders. Each process tries to compile the entire list, but in fact in some of the cases it may load from cache some shader that was compiled a second ago by a different process. This makes the entire thing really simple to implement, you don't need to manually transfer bin shader code between processes, you can just trust the shader cache to do this for you.