Go to bottom

GLSL Multithreaded Shader Compilation on NVIDIA

category: code [glöplog]

Target Platform: Machine with latest NV Drivers (335.23)
Problem: Tons of complicated shaders need to be compiled (fast).
Straight forward solution: Do it multicore, it's just shader compilation, what could possibly go wrong...


By now I attempted several things (Including the creation of fake contexts for each thread and moving program binaries around). I have the suspicion that the driver serializes all the shader compiles - no matter what dirty trick I try.
Of course I also attempted this (end of the page) - no way.

Any ideas and/or more information what the NV driver actually does(n't)?

P.S.: For benchmarks %APPDATA%\NVIDIA\GLCache\ should be deleted before every test, otherwise your shaders are already kind of precompiled.
added on the 2014-04-09 00:42:45 by las las
sure fucked up. interesting. it's valid. the compilers should run as task in another thread.

the 'easiest' is to write your own compiler and feed your own threads.
added on the 2014-04-09 01:51:42 by yumeji yumeji
yumeji: write your own compiler for GLSL? Wat.
added on the 2014-04-09 08:47:00 by kbi kbi
las: How about launching multiple processes that pre-compile the shader, hoping that the shader-cache kicks in when the "main" program is run?
added on the 2014-04-09 10:20:00 by kusma kusma
las: Also, try to avoid checking the compile-status right away. That *might* actually be what causes the driver to serialize.
added on the 2014-04-09 10:21:40 by kusma kusma
Let's start with a simple question: Did anyone here get multithreaded GLSL shader compilation to work properly?

@kusma: I tried exactly these approaches before the initial post. Maybe I was tired and totally screwed up everything (shit happens) - but I don't think so.


Here is a path which would enable a GL driver to do parallel shader compiling and linking


Here is a path which enables the NVIDIA GL driver to do parallel shader compiling and linking

Any information on whether this works on any existing driver implementation?

And it's basically pretty much the same logic which is in the How to ensure that an OpenGL implementation will be able to build GLSL shaders in parallels? paragraph of the link I posted before, which didn't seem to change anything.
added on the 2014-04-09 11:02:56 by las las
@las: Timothy Lottes used to work for NVIDIA, so it wouldn't be too far fetched to expect it to apply there.

Interestingly though, this exact issue was very recently brought up on the Mesa mailing list as well. In fact, it was posted there almost at exactly same time as you posted here. Coincidence? Ian, are you reading this? ;)
added on the 2014-04-09 14:48:11 by kusma kusma
@kusma: I know.
As mentioned before: I tried it - it's basically the same approach as mentioned in the g-trunc post and it does not seem to work.

Absolute coincidence - since I was just trying to overcome an existing problem... ;)
added on the 2014-04-09 17:20:46 by las las
las, just making sure, you tried kusmas multi-processes thing and compile time was the same as single process?
added on the 2014-04-10 00:47:45 by TLM TLM
If you had actually read my previous posts, you should be able to come to that conclusion.

Again: The approach Lottes proposes is basically the same thing as in the g-trunc post and it does not even ask you to do any fancy threading stuff in your application.
It's just the way you should call the OpenGL API to give the driver the chance to handle the shader compilation in parallels. So this potentially enables multithreaded compilation on the driver side. If I'm not mistaken, Lottes does not mention at all whether this actually works on NVIDIA Drivers or not. Seems it doesn't with the current one.

It also seems that there is little information available on these issues. It's 2014, come on...

If I have the time, I'll throw something together - some kind of minimal reproducer and make that available for testing, but I'll be pretty busy the next weeks.

If somebody gets some positive results compilation time wise - please let me know. Probably this stuff just doesn't work properly with the OpenGL compatibility contexts that I'm using.
added on the 2014-04-10 13:27:19 by las las
Las, I might be missing something, but it seems that multi-process parallelization works for me.
1 process => ~13s
2 processes => ~14s
4 processes => ~16s
6 processes => ~25s

Platform: Quad Core, GTX-560, NV Drivers 335.23, Windows 7 x86

Just for reference
Code: // Get shaders string fragShader = "... fragment shader code ..." ; string vertexShader = "... vertex shader code ..."; // Compile loop int s1 = GetTickCount(); for (int i = 0; i < 200;i++) { // Add some changing comment to shader (to fuck up with cache) char postfix[12]; sprintf(postfix, "// %d %d \r\n", i, s1); // Compile shader compileAndLinkShader(vertexShader, fragShader + postfix); } int s2 = GetTickCount(); // Print time cout << (s2-s1) / 1000.0;
added on the 2014-04-13 23:39:53 by TLM TLM
Las, I might be missing something, but it seems that multi-process parallelization works for me.
1 process => ~13s
2 processes => ~14s
4 processes => ~16s
6 processes => ~25s

TLM, I might be missing something, but shouldn't timings get lower with increasing process count?
added on the 2014-04-14 10:34:21 by xTr1m xTr1m
i'm guessing he's compiling 200 shaders in each process, so it's doing 6x more shaders in ~2x more time in the last case. So many other things could be affecting the times there though, i'm not sure it's safe to say the actual compilation is happening in parallel.
added on the 2014-04-14 13:22:39 by psonice psonice
Ah now I understand, thanks psonice
added on the 2014-04-14 13:41:21 by xTr1m xTr1m
psonice, you are basicly correct. Couple of notes:
1. It seems to scale nicely with the number of CPUs/cores.
2. Regard the "not sure it's safe to say the actual compilation is happening in parallel", I agree, its hard to tell what is really going on. Just consider this: not clearing the shader cache makes things super fast.
3. If you just change the sprintf to "sprintf(postfix, "// %d \r\n", i);" the total time get faster by a factor of x4, this happens since as all processes are trying to compile the same set of 200 shaders. Each process tries to compile the entire list, but in fact in some of the cases it may load from cache some shader that was compiled a second ago by a different process. This makes the entire thing really simple to implement, you don't need to manually transfer bin shader code between processes, you can just trust the shader cache to do this for you.
added on the 2014-04-15 02:38:23 by TLM TLM


Go to top