Threads (CPU not BBS)!
category: code [glöplog]
beeing afraid of threads is so 90th...
  
you could always go for an event based engine, that'll make it possible to do parallel tasks
  
Chaos+Ryg: thanks for mentionning my work. 
Design towards having only two kinds of threads:
- those that work a lot (tasks)
- those that wait a lot (IOs)
On a machine with n cores, you want n threads of the first kind, including the main (as Chaos said).
You can actually have quite a few threads of the second kind, those that make sure disk, network and speakers are kept busy. The key is to have them do very little when they're active (or spawn work for main+workers threads to execute).
All in all, diving big chunks of work in smaller chunks is the easy part.
The fun starts when you need to synchronize things together. Race conditions are not only hard to fix, they can be hard to spot/repeat. When faced with a race, I see people jumping to the easy solution : add mutex.
<hypnosis tone>
Stop. Pause. Think.
More mutexes = more race conditions
painting self in corner - dig self into hole - sawing off branch sat on...
A good mutex is a mutex not used
</hypnosis tone>
When you see a race condition, you want to make it not happen _by design_
This is the big software design challenge: how to create code that doesn't require tasks to be aware of what the others do. It's a slightly different way of thinking, a lot still needs to be invented...
For more explanations on how my sample achieves minimal use of mutexes, read the end of this article:
http://www.gamasutra.com/view/feature/4287/sponsored_feature_doityourself_.php
  
Design towards having only two kinds of threads:
- those that work a lot (tasks)
- those that wait a lot (IOs)
On a machine with n cores, you want n threads of the first kind, including the main (as Chaos said).
You can actually have quite a few threads of the second kind, those that make sure disk, network and speakers are kept busy. The key is to have them do very little when they're active (or spawn work for main+workers threads to execute).
All in all, diving big chunks of work in smaller chunks is the easy part.
The fun starts when you need to synchronize things together. Race conditions are not only hard to fix, they can be hard to spot/repeat. When faced with a race, I see people jumping to the easy solution : add mutex.
<hypnosis tone>
Stop. Pause. Think.
More mutexes = more race conditions
painting self in corner - dig self into hole - sawing off branch sat on...
A good mutex is a mutex not used
</hypnosis tone>
When you see a race condition, you want to make it not happen _by design_
This is the big software design challenge: how to create code that doesn't require tasks to be aware of what the others do. It's a slightly different way of thinking, a lot still needs to be invented...
For more explanations on how my sample achieves minimal use of mutexes, read the end of this article:
http://www.gamasutra.com/view/feature/4287/sponsored_feature_doityourself_.php
on such a machine i think threading your app may have benefits...

question : om such an system, what would you do?
  

question : om such an system, what would you do?
Quote:
The fun starts when you need to synchronize things together. Race conditions are not only hard to fix, they can be hard to spot/repeat. When faced with a race, I see people jumping to the easy solution : add mutex.
The very core of this problem lies in the fact that a lot of programmers still don't seem to get that certainly in these situations (and many others) less is more and that (added) complexity is not to be favored over simplicity; especially since the latter can be harder to attain.
tigrou: that is exactly the reason why the thread per task approach is doomed to fail. 
the real question is: with so many threads trying to steal work from each other, will it stall all the time? The critical section at the heart of the algorithm might become a bit crowded.
  
the real question is: with so many threads trying to steal work from each other, will it stall all the time? The critical section at the heart of the algorithm might become a bit crowded.
There's also the case of bad aliasing, where several threads try to access the same cache line at the same time, resulting in so much handshaking that the result is way slower than just using one thread.. =)
Anyway, in my build example, I think the reason why 21 tasks was better than 4 is the required I/O waits.
  
Anyway, in my build example, I think the reason why 21 tasks was better than 4 is the required I/O waits.
Any thoughts on apple's grand central dispatch thing? I've yet to use it in anger, but it seems to solve some of the issues - it manages the thread pool, so the number of threads available matches the number of available cores and threads aren't assigned to busy cores so stalls are avoided. The way it works seems pretty clean too. It's open sourced, no idea if a windows port is available though.
  
psonice, doesn't the GCD come with a syntax extension? I've seen the block declarations here and there and they look nice and very usable. At last a lot better than firing off your own threads via pthreads or whatever.
However as new language features it will take at least a decade until we see support on all relevant platforms.
  
However as new language features it will take at least a decade until we see support on all relevant platforms.
What would iq do with that machine?
  
Yeah, it uses 'blocks' which is a syntax extension. I seem to remember you can use functions instead of blocks which might get around that issue.
Taking a quick look though, it seems it's being ported to bsd and not much else as yet, guess it needs access to some lower level stuff that makes porting it outside of bsd hard work.
  
Taking a quick look though, it seems it's being ported to bsd and not much else as yet, guess it needs access to some lower level stuff that makes porting it outside of bsd hard work.
psonice: GCD seems like a sort of sensible linguistig approach to deferring the problem of sensibly-grained parallel processing to the OS -- it's quite high on my list of 'things to try' - if you get to take it for a spin don't hesitate to post any findings (too busy myself right now)
  
i have got a use for it actually - i want to offload a bunch of work from the gpu (which is slowly being crushed by the amount of work i'm dumping on it :) and it's highly suitable for threading. I was hoping to keep 10.5 support though. Laziness is kind of luring me towards 10.6/gcd (plus opencl is very appealing for this app) so we'll see. I'll post up what happens if i do it. (this is a video processing kind of app btw, not some uber demo unfortunately)
  
I couldn't care less what kind of app. it is actually - interested if you get some results / get to see some bottlenecks of the whole thing et cetera.
  
But... Why not OpenMP?
It works, it's easy, doesn't mess up the code too much...
  
It works, it's easy, doesn't mess up the code too much...
OpenMP is nice and good as long as you're just after pure data decomposition. When you start wanting to parallelize a whole game engine type of app, you encounter all sorts of situations that get more and more awkward to deal with in OpenMP only (oh, and _don't_ start mixing and matching threading libraries!)
My latest bit of fun has been about having the main thread submit draw calls while the rest of the processor would work on the next frame. There is something in latest TBB that makes it really easy: TaskGroups. nulstein also has this ability to spawn tasks while you do something else in the calling thread. I reckon this is a key feature when pipelining things and I really don't think you can even hope to see this in OpenMP.
But, generally, the main reason why I'm not using OpenMP is control. Having all this code hidden behind #pragma's makes me uncomfortable ;)
  
My latest bit of fun has been about having the main thread submit draw calls while the rest of the processor would work on the next frame. There is something in latest TBB that makes it really easy: TaskGroups. nulstein also has this ability to spawn tasks while you do something else in the calling thread. I reckon this is a key feature when pipelining things and I really don't think you can even hope to see this in OpenMP.
But, generally, the main reason why I'm not using OpenMP is control. Having all this code hidden behind #pragma's makes me uncomfortable ;)
Quote:
My latest bit of fun has been about having the main thread submit draw calls while the rest of the processor would work on the next frame.
Do we know, does typically each core have it's own cache? I would assume so but you never know. Also, I wonder how cores negotiate what is dirty between them.
Quote:
Do we know, does typically each core have it's own cache?
L1 yes, L2 depends (sometimes yes, sometimes no), L3 no (always just one L3 per package in all the architectures I've seen so far)
Quote:
I would assume so but you never know.
This is mentioned in virtually every in-depth processor review, it's in the architecture block diagrams, and it's also on Wikipedia.
Quote:
Also, I wonder how cores negotiate what is dirty between them.
http://en.wikipedia.org/wiki/Cache_coherence.
I'd add that with hyperthreading on, you have two "logical cores" running off the same cache ie L3 all the way up to L1... Add to this that the OS can reschedule your thread at any time and you have a nice rubiky toy to play with.
If you're trying to take advantage of how the cache is shared, you pretty much have to setup your thread affinities so the OS has got no room for anything "automagic" but then, you have to take care of anything automagic that might have helped you before...
What about Last Level Cache being shared always? it's not, in processors like the Q6600 (which has proved very popular), we really have two dual-core chips in one and you can pick cores that don't share any cache at all...
Thankfully, caches share information efficiently and are designed to work transparently. But if you really have to mess with this, you may find these articles useful:
http://www3.intel.com/cd/ids/developer/asmo-na/eng/276613.htm
http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/
  
If you're trying to take advantage of how the cache is shared, you pretty much have to setup your thread affinities so the OS has got no room for anything "automagic" but then, you have to take care of anything automagic that might have helped you before...
What about Last Level Cache being shared always? it's not, in processors like the Q6600 (which has proved very popular), we really have two dual-core chips in one and you can pick cores that don't share any cache at all...
Thankfully, caches share information efficiently and are designed to work transparently. But if you really have to mess with this, you may find these articles useful:
http://www3.intel.com/cd/ids/developer/asmo-na/eng/276613.htm
http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/
ahh hyperthreading, that's what I was after. I knew there was a particular distinction between threads and core that we weren't discussing... 
  
Wasnt PC GTA game that "fully utilized" 2 cores successfully?
  
main.c
output
Does it look good enough? =)
  
Code:
void TestFunction(void* a, void* b)
{
	printf("Function %i sleeping %i\n", (int)b, (int)a);
	
	Timer timer;
	while(timer.dGet()<(double)((int)a)/1000.0){};	//Sleep((int)a);
}
int main(char* argv[], int argc)
{
	Timer timer;
	ThreadManager ThreadMan;
	ThreadMan.InitTM();
	printf("Starting the pushing...\n");
	timer.Reset();
	ThreadMan.Push(TestFunction, (void*)1000, (void*)1);
	ThreadMan.Push(TestFunction, (void*)1000, (void*)2);
	ThreadMan.Push(TestFunction, (void*)1000, (void*)3);
	ThreadMan.Push(TestFunction, (void*)1000, (void*)4);
	ThreadMan.Push(TestFunction, (void*)1000, (void*)5);
	ThreadMan.Push(TestFunction, (void*)1000, (void*)6);
	ThreadMan.Push(TestFunction, (void*)1000, (void*)7);
	ThreadMan.Push(TestFunction, (void*)1000, (void*)8);
	ThreadMan.Push(TestFunction, (void*)1000, (void*)9);
	ThreadMan.WorkAndWait();
	printf("\n%f seconds!\n", timer.dGet());
	ThreadMan.KillTM();
	system("PAUSE");
	return 0;
}
output
Code:
Thread 1 initialized;
Starting the pushing...
Pushing 1...
Pushing 2...
Thread 2 initialized;
Thread 2 working: 2 running, 0 queued
Function 2 sleeping 1000
Pushing 3...
Pushing 4...
Pushing 5...
Pushing 6...
Pushing 7...
Pushing 8...
Pushing 9...
Thread 0 working: 3 running, 6 queued
Function 3 sleeping 1000
Thread 3 initialized;
Thread 3 working: 4 running, 5 queued
Function 4 sleeping 1000
Thread 1 working: 1 running, 0 queued
Function 1 sleeping 1000
Thread 2 working: 4 running, 4 queued
Function 5 sleeping 1000
Thread 0 working: 4 running, 3 queued
Function 6 sleeping 1000
Thread 3 working: 4 running, 2 queued
Function 7 sleeping 1000
Thread 1 working: 4 running, 1 queued
Function 8 sleeping 1000
Thread 2 working: 4 running, 0 queued
Function 9 sleeping 1000
3.002525 seconds!
Thread 2 says good bye;
Thread 1 says good bye;
Thread 3 says good bye;
Press any key to continue . . .
Does it look good enough? =)
It takes ~2 seconds with 8 cores. So at least for stupid sleep functions it works fine.
  
When doing multi-thread aware code, don't forget also to take into consideration all the APIs you call that may be thread-safe in a very crappy way. A typical thing in game development is to use a custom memory allocator, reuse it year after year, and finally realise that there is a global mutex added by somebody to make it thread safe, resulting in a game that runs slower on a dual core than it would on a single core...
  
My worker threads... what should they do when the queue is empty? Currently I have a Sleep(0), not sure if that's a good idea.
  













