Is SDL slow or is it my code that sucks?
category: general [glöplog]
Hello everyone,
I'm currently working on my first demo-ish thing, but I have a serious perfomance-problem and hope anyone could help me. I *do* have trust that it's possible to get a serious answer on pouet BBS :P
So... I started coding some oldskoolish effects like plasma and stuff. I've optimized it quite a bit and it runs fine in 640x480, around 70fps, but in 1024x768 it's horribly slow (around 30fps). I tried to find out where the problem is and then noticed, that it's not my plasma code. The plasma calculation routine actually needs very little CPU-Power... around 90% of the CPU-power is used just to draw it on the screen.
The hardware I'm currently working on is an iBook G4 1,2GHz running OS X 10.4.4, but I'm planning to get a PC and port it over to Windows. I'm using the SDL API and currently use no hardware-3D.
I wrote a little test function, it looks like this:
Uint32 *scr is just a pointer to the framebuffer (actually I use pageflipping and make scr always point to the buffer which is currently not on the screen). So this function does nothing but making all pixels white. It works, yeah... but I've measured the frame rate... and it's about 35 fps, in 1024x768 32bit.
How can this be? Where does all the CPU power go? Is it SDL that sucks? Is it the SDL-OS X-Port that sucks? Is it OS X that sucks? Or is it my code that sucks? If the last one is true, what am I doing wrong?
I'd appreciate helpful comments/links a lot, since I'm quite a beginner in demo-coding.
I'm currently working on my first demo-ish thing, but I have a serious perfomance-problem and hope anyone could help me. I *do* have trust that it's possible to get a serious answer on pouet BBS :P
So... I started coding some oldskoolish effects like plasma and stuff. I've optimized it quite a bit and it runs fine in 640x480, around 70fps, but in 1024x768 it's horribly slow (around 30fps). I tried to find out where the problem is and then noticed, that it's not my plasma code. The plasma calculation routine actually needs very little CPU-Power... around 90% of the CPU-power is used just to draw it on the screen.
The hardware I'm currently working on is an iBook G4 1,2GHz running OS X 10.4.4, but I'm planning to get a PC and port it over to Windows. I'm using the SDL API and currently use no hardware-3D.
I wrote a little test function, it looks like this:
Code:
for(i=0;i<HEIGHT;i++)
{
for(j=0;j<WIDTH;j++)
{
*scr=0xffffffff;
scr++;
}
}
// back in the main function is an SDL_Flip(screen) and then the whole thing repeats from the start
Uint32 *scr is just a pointer to the framebuffer (actually I use pageflipping and make scr always point to the buffer which is currently not on the screen). So this function does nothing but making all pixels white. It works, yeah... but I've measured the frame rate... and it's about 35 fps, in 1024x768 32bit.
How can this be? Where does all the CPU power go? Is it SDL that sucks? Is it the SDL-OS X-Port that sucks? Is it OS X that sucks? Or is it my code that sucks? If the last one is true, what am I doing wrong?
I'd appreciate helpful comments/links a lot, since I'm quite a beginner in demo-coding.
ohno more coders!
i think because of the page flipping. if it renders in 1/69th of a second you have to wait one vblanks so it will runs at 35 fps
i think because of the page flipping. if it renders in 1/69th of a second you have to wait one vblanks so it will runs at 35 fps
wrong question. they will say: your code suxx :D
Do a bit of maths...
1024*768=786432
786432*4 bytes (32 bits)=3145728
So, exactly 3 megabytes of data by every screen, and, of course, it is not cached. So, by 35 fps, and it you are only writing it linearly, 105 megabytes of data per second. But, you are painting in a framebuffer, wich is in RAM and not in VRAM. First, you are using 32 bit access and you maybe has DDRAM wich needs 64 bits (or more) to go at full speed. Then, remember that the data should be copied from RAM to VRAM. So well, you can get some more fps if you code the best as possible, but software rendering and 1024x768 is slow even on the faster computers.
1024*768=786432
786432*4 bytes (32 bits)=3145728
So, exactly 3 megabytes of data by every screen, and, of course, it is not cached. So, by 35 fps, and it you are only writing it linearly, 105 megabytes of data per second. But, you are painting in a framebuffer, wich is in RAM and not in VRAM. First, you are using 32 bit access and you maybe has DDRAM wich needs 64 bits (or more) to go at full speed. Then, remember that the data should be copied from RAM to VRAM. So well, you can get some more fps if you code the best as possible, but software rendering and 1024x768 is slow even on the faster computers.
micronuke: I really like your question. And that's because we have similar worries in the way of the coder I guess :)
Well, you are talking about Macintosh, I don't know how are things there but I'll talk about my PC experience.
I had exactly the same worry. Well, when I moved from my old habits (Dos, Quickbasic, Assembly, 8bits) also to something times more modern on the PC (C and SDL me too), I had the vision that I could easilly do 1024*768 optimized plasmas running at 200fps!!! Those were my dreams. But reality told me it's too much for PC no matter how powerfull they seemed to me.
And yes, if I REM the SDL_Flip(screen) line so that only the calculation remails, my 1024*768 software plasma also gets 170fps instead of only(?) 40fps when there is output. But that's enough for most. And 30 is quite fast for 1024*768. Also, I still miss the old drivers for my gfx card (I've lost the CD actually ;P) where I got the double speed than today. I don't know why the new ATI drivers drop perfomance in that aspect but anyways I've stopped caring much now..
Probably a matter of slow gfx/ram bandwidth of PCs (does the same happen in mac computers too?) but still 30fps is quite well for such a big resolution, I am sure your plasma is quite well optimized. If you are still crazy enough about extreme perfomance (and I like that),. well..
.. I also tried TinyPTC library. On PC, it uses some MMX code for speeding up memory/vram writes. In some old project I got at least twice perfomance than SDL. But now, strangely, with the new gfx drivers, the TPTC tests I just try seem to show the same or a bit less speed than the same SDL tests.
So, I just stopped caring and currently prefer SDL because 100fps in 640*480 is enough and there are also much more functions for keys/sound/etc. than in TPTC.
But I still liked your "hey, my 1024*768 plasma does only 30fps" caring. Most people don't care, I used to be a performance freak once but now I got dissapointed by those strange things and just try to use what works and optimize the algorithm. Perhaps you will find a good sollution to this :)
Well, you are talking about Macintosh, I don't know how are things there but I'll talk about my PC experience.
I had exactly the same worry. Well, when I moved from my old habits (Dos, Quickbasic, Assembly, 8bits) also to something times more modern on the PC (C and SDL me too), I had the vision that I could easilly do 1024*768 optimized plasmas running at 200fps!!! Those were my dreams. But reality told me it's too much for PC no matter how powerfull they seemed to me.
And yes, if I REM the SDL_Flip(screen) line so that only the calculation remails, my 1024*768 software plasma also gets 170fps instead of only(?) 40fps when there is output. But that's enough for most. And 30 is quite fast for 1024*768. Also, I still miss the old drivers for my gfx card (I've lost the CD actually ;P) where I got the double speed than today. I don't know why the new ATI drivers drop perfomance in that aspect but anyways I've stopped caring much now..
Probably a matter of slow gfx/ram bandwidth of PCs (does the same happen in mac computers too?) but still 30fps is quite well for such a big resolution, I am sure your plasma is quite well optimized. If you are still crazy enough about extreme perfomance (and I like that),. well..
.. I also tried TinyPTC library. On PC, it uses some MMX code for speeding up memory/vram writes. In some old project I got at least twice perfomance than SDL. But now, strangely, with the new gfx drivers, the TPTC tests I just try seem to show the same or a bit less speed than the same SDL tests.
So, I just stopped caring and currently prefer SDL because 100fps in 640*480 is enough and there are also much more functions for keys/sound/etc. than in TPTC.
But I still liked your "hey, my 1024*768 plasma does only 30fps" caring. Most people don't care, I used to be a performance freak once but now I got dissapointed by those strange things and just try to use what works and optimize the algorithm. Perhaps you will find a good sollution to this :)
optimus, use an movntq instead of movq version of tinyptc, works much better
Well, it could also be some vsync there, perhaps Ye_ti is right. But you can check that..
Also I had an old question, perhaps silly. Why does this happen? What are the inner working of those two libs that make the diferrence? Would that still mean I could earn that performance from somewhere?
I did tests with SDL or TPTC in the past.
1) On SDL, I got let's say 100fps. But while the effect was running (windowed), I put the pointer somewhere outside the window on another icon in the bar so that the yellow box with description of other programms appear. When this happened, I show a boost up of fps, to 130fps for example.
2) When, I move part of the SDL window or all of it, outside the screen, the FPS climbs up. It's like SDL was clever enough to only loose performance by drawing the window portion needed.
These two things don't happen in TPTC which was much faster at the older times. I asked about some of these things to an SDL mailinglist but nobody knew. And now that I changed my gfx card drivers, TPTC shows equal or a bit worse performance than SDL, unlike the older times.
Sometimes on the PC, I guess I have to choose a good gfx library and forget about all these stuff. Afteralls, FPS counts look diferrent in diferrent PCs :(
Also I had an old question, perhaps silly. Why does this happen? What are the inner working of those two libs that make the diferrence? Would that still mean I could earn that performance from somewhere?
I did tests with SDL or TPTC in the past.
1) On SDL, I got let's say 100fps. But while the effect was running (windowed), I put the pointer somewhere outside the window on another icon in the bar so that the yellow box with description of other programms appear. When this happened, I show a boost up of fps, to 130fps for example.
2) When, I move part of the SDL window or all of it, outside the screen, the FPS climbs up. It's like SDL was clever enough to only loose performance by drawing the window portion needed.
These two things don't happen in TPTC which was much faster at the older times. I asked about some of these things to an SDL mailinglist but nobody knew. And now that I changed my gfx card drivers, TPTC shows equal or a bit worse performance than SDL, unlike the older times.
Sometimes on the PC, I guess I have to choose a good gfx library and forget about all these stuff. Afteralls, FPS counts look diferrent in diferrent PCs :(
>optimus, use an movntq instead of movq version of tinyptc, works much better
Where can I find that? I'd still like to check if there will be performance increase even if I am stuck to SDL now which I like. Or should I just replace that in the asm code and recompile TPTC? Ugh..
Performance things that matter me about PC but I was tired caring anymore. The most extreme? What is that P4 bug that causes a 256b intro to run at 30-70fps here in Athlon but only 2-3fps in an Intel Pentium4??? So huge performance diferrence in similar PCs? And why is this bug really happening, can anyone tell me? I am curious..
Where can I find that? I'd still like to check if there will be performance increase even if I am stuck to SDL now which I like. Or should I just replace that in the asm code and recompile TPTC? Ugh..
Performance things that matter me about PC but I was tired caring anymore. The most extreme? What is that P4 bug that causes a 256b intro to run at 30-70fps here in Athlon but only 2-3fps in an Intel Pentium4??? So huge performance diferrence in similar PCs? And why is this bug really happening, can anyone tell me? I am curious..
SDL already searches for the best blitting performance, see here
You can always call SDL_GetVideoInfo() to see what are your video hardware capabilities (ie: hardware blitting).
You can always call SDL_GetVideoInfo() to see what are your video hardware capabilities (ie: hardware blitting).
Well, not that I know much about that topic, but I'll give my two cents anyways.
PCI has a bandwidth of 100 MB/s. So in 1024x768x32 the max. framerate is 32. In 640x480x32 it is 81. This somewhat corresponds to your numbers.
I just wonder why AGP isn't used. AGP 1x has 0.5 GB/s (if I'm right). That would give you 159 FPS in 1024x768. Don't ask me why this isn't the case, maybe the back buffer can't be accessed through AGP? I have no idea.
I'd go for a hardware accellerated backend anyways, as it will give you stretching, alpha-blending, bilinear filtering and what not for free, plus a dynamic texture that (hopefully) can be accessed through AGP.
PCI has a bandwidth of 100 MB/s. So in 1024x768x32 the max. framerate is 32. In 640x480x32 it is 81. This somewhat corresponds to your numbers.
I just wonder why AGP isn't used. AGP 1x has 0.5 GB/s (if I'm right). That would give you 159 FPS in 1024x768. Don't ask me why this isn't the case, maybe the back buffer can't be accessed through AGP? I have no idea.
I'd go for a hardware accellerated backend anyways, as it will give you stretching, alpha-blending, bilinear filtering and what not for free, plus a dynamic texture that (hopefully) can be accessed through AGP.
Optimus: In ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf Intel writes "Deep pipeline to enable industry-leading clock rates for desktop PCs and servers" (page 2-7) and "Up to 126 instructions in flight" (2-8). A deep pipeline, and a new branch prediction unit to compensate the huge misprediction losses. I suppose that's the problem. And they even admit that it is all just for the clock rate...
micronuke: remember to enter it into the Rushed Metal Demo Coding Competition once it's finished.
why optimize? what about writing optimal code instead? chuck norris can....
p = buffer;
buffEnd = buffer + width * height;
for ( ; p != buffEnd; ++p ) *p = 0xFFFFFFFF;
buffEnd = buffer + width * height;
for ( ; p != buffEnd; ++p ) *p = 0xFFFFFFFF;
for ( ; p != buffEnd; *p++ = 0xFFFFFFFF);
while ( p < buffEnd ) *p++ = 0xGAY.
for(unsigned ecx=w*h;ecx--;*p++=~00);
__int64* p = (__int64 *)buffer;
unsigned ecx = w*h/2;
do{ *p++ = 0xFFFFFFFFFFFFFFFF; } while(--ecx);
unsigned ecx = w*h/2;
do{ *p++ = 0xFFFFFFFFFFFFFFFF; } while(--ecx);
i'm impressed
optimizing blitter functions in 2006 \o/
remind me not to code ever again :D
remind me not to code ever again :D
that's so 1998!
lea buf,a0
moveq #-1,d0
move.w #w*h/4/8-1,d7
.loop REPT 8
move.l d0,(a0)+
ENDR
dbf d7,.loop
\o/
moveq #-1,d0
move.w #w*h/4/8-1,d7
.loop REPT 8
move.l d0,(a0)+
ENDR
dbf d7,.loop
\o/