ptc & pixeltoaster demo's

category: general [glöplog]

xteraco, I have very few time right now, but I accept to explain all what I know about oldschool democoding... but hey, lets use the pouet bbs for this!

Optimization of oldschool effects using new hardware is really fun. There are (at least) two levels of optimization, theoretical level and implementation level. Since theoretical level is not machine dependant, it has been always the same, and all (99'99999%) of what you are going to do has been already studied and done now. By other way, the implementation is hardware dependant and in new computers it is really fun since we have now a lot of parallelization possibilitiess (SIMD and dual-quad cores). Mixing and trying both things for the faster result as possible in your computer is... why demoscene coding is fun :D And before 3d gfx cards it was a lot of what it was about... now it is important but not that much. But coding software rendering is to do again the same... this is why I continue doing it. You will be surprised of the horse power of new computers if coded correctly.

added on the 2007-02-07 14:48:30 by texel

ok,

#1. you dont have to use std::vector, just pass in an array and it works too, you can even compile std out completely if you dont want it

#2. texel: the float to int conversion actually does a lot more than just cast, it also scales [0,255] into [0.0f,1.0f] and clamps - if you are so smart, try looking at the code and work out how it does it - teehee ;)

#3 - sure, SSE can a little bit faster, but i am so very lazy these days :)

cheers

added on the 2007-03-18 06:17:12 by Gaffer

kusma, your framework looks fine, i'll have some fun someday ;) but not now, i have other works in the pipeline..

added on the 2007-03-18 07:12:14 by nystep

While speaking of software rendering, does anybody know a really fast way of box car blurring? I'm currently doing two passes, one horizontal and one vertical but it's rather slow :S Anyone interested in discussing that?

added on the 2007-03-18 12:07:09 by pailes_

if you're in software you can try out a Finite response filter technique kind of thing, you might get better results... just take advantage of the direction in which you process the data since you can decide it in software..

added on the 2007-03-18 12:45:09 by nystep

pailes: hoy slow? I get 111 fps at 640x480 box bluring (with filter size independant performance). And no SSE yet!

-----------

I don't use any library when doing pixel based stuff... The only thing I need to dump a framebuffer to the screen in windows is this:

void drawBuffer( int32 *buffer, int xres, int yres )
{
BITMAPINFO bmi = { { sizeof(BITMAPINFOHEADER), xres, -yres, 1,
32, BI_RGB, 0, 0, 0, 0, 0 }, { 0, 0, 0, 0 } };

StretchDIBits( hDC, 0, 0, xres, yres, 0, 0, xres, yres, buffer, &bmi,
DIB_RGB_COLORS, SRCCOPY );
}

enf of story. You _really_ have to work hard before this is your performance bottleneck.

For the floating point buffer, I think is not very useful than the final framebuffer is floating point. I mean, just multiplying by 255 and clamping is not a very good tonemapping... and tonemapping is what you need if you want your HDR to look nice. I suppose one should do the rendering in his own internal fp buffer, and then tonemap to a 4 Bpp framebuffer supplied by TinyPTC, I don't really see the use of fp as final target. May be 10 bit per color component framebuffers yes, even 12. But fp...

Anyway, you only need one function as I said (not counting setting windows to fullscreen, but that you can copy from your current opengl framework). In Linux you just need XPutImage() + XFlush().

Or you use glDrawPixels(). Yeah, it sounded lame to me also in the begining... but in many systems is just faster than any GDI/X function, and it works everywhere...

added on the 2007-03-18 13:49:50 by iq

iq: 111fps on what machine and at what pixel depth?

added on the 2007-03-18 16:44:41 by pailes_

pailes: a separated box blur filter of variable filter-size should run in linear speed with only one memory-read, one mul, one add, one sub and one memory-write per pixel per pass. On most systems, this should result in something less than 15 cycles per pixel per pass. If you're running on a cached system, it might be an advantage to actually flip the x and y axis on write-back so the y-pass doesn't kill cache-coherency.

added on the 2007-03-18 17:26:32 by kusma

IQ, and while we're at it, if color fidelity is the goal, why do software rendering in RGB, rather than XYZ or something like the IPT colorspace?

added on the 2007-03-18 17:49:51 by _-_-__

knos: rendering in CIE XYZ is quite common in high-end HDR-rendering, it seems...

added on the 2007-03-18 18:49:43 by kusma

@kus ma bite:
Sounds pretty much like what I have been doing already (flip x/y axis on write back). Still I'm not happy with the performance of the result, although I had a little bit of an overhead because I've been filtering 16 bits (565) image data.

added on the 2007-03-18 21:46:00 by pailes_

once we ship mercs2 i'll be adding tonemapping support to pixeltoaster, aimed at hardware with floating point framebuffer support

it'll be off by default (slow to emulate!) but you'll have a software version available too, at request

cheers

added on the 2007-03-18 22:29:09 by Gaffer

pailes: 32 bpp, Athlon 3200 (32 bit mode). I use two passes as everyone. In the first one, horizontal, I output vertical pixels so I can use the same routine for the horizontal and vertical passes (and cache works fine too in second pass). As kus ma bite said, it's one add, one sub, one mul and one shift per pixel component. Another thing is that you can accumulate red and blue components at the same time, and green later reducing all the work from 3 to 2 operations (but you have some bitmasking to do). Ah, I also split the scanline loop in three blocks to avoid "if" statements for image boundary checks in every iteration, and I also unrolled the inner loop 4 times. I did not use SSE yet in these piece of code, but it's easy anyway. I also didn't try to _mm_prefetch data yet, I don't know if it will help.

If you are using Yuv instead of RGB, bluring only luminance (Y) is probably good enought ;)

added on the 2007-03-19 12:14:00 by iq

Working on red and blue in the same register and green in a separate one is always a neat trick. But hey, if you generate the input-buffers yourself you can in some cases avoid the masking entirely ;)

added on the 2007-03-19 12:21:59 by kusma

"if you can't find the problem, try looking elsewhere"

pehaps neither the filter nor the library is the bottleneck.

added on the 2007-03-19 12:35:59 by rasmus/loonies

"rendering in CIE XYZ is quite common in high-end HDR-rendering"
yep, and multispectral rendering (i.e. more than 3 primaries) is gaining popularity: you need a finer resolution and a range outside the visible spectrum to get (wavelength-dependent) refraction-, fluorescence-, and phosphorescence-based phenomena to look right: very relevant for e.g. car paints and cloth rendering, because detergents contain substances that transfer short-wavelength (UV) light into the visible spectrum to get a more glowing white ("whiter than white", anyone? :)

added on the 2007-03-19 13:01:47 by ryg

@iq & kusma: Thanks for the hints, I'll give it a go again. I just don't feel like optimizing on the assembly level because the code should run on multiple CPU architectures and I'm pretty lazy ;)

added on the 2007-03-19 15:48:33 by pailes_

pailes: as long as you give your compiler a good chance to optimize for you, you usually get close enough to optimal speed. The tricky parts is unrolling in c without loosing readability and generating LDM/STM sequences on platforms that support those. Inspecting the assembly output from the compiler is often nice to make sure you're doing the right thing :)

added on the 2007-03-19 15:53:27 by kusma

for multispectral rendering, I think there are two approaches. You either keep the compelte spectrum with each ray (and not just 3 samples - rgb/xyz) or you give only one spectral sample to each ray (the final color is achieved by mere accumulation ). Even if slower, I find the second one more elegant and simple to implement (specially for those wavelength dependant bdrfs). I wonder how maxwell renderer is doing...

added on the 2007-03-19 15:56:28 by iq

pailes: as iq said, the clincher is the vertical pass. getting a fast horizontal pass is easy, but a naive implementation of the vertical pass kills the cache completely and speed goes down the drain. this is not "10% slower" or sth like that, more like "takes 6x as long as it should". if you have enough registers, you can simply process 4 (or even 8) columns in parallel, which already helps a lot.

transposing the image inbetween as iq mentioned also works beautifully. you can either just write out "vertically spaced" samples directly (if this is a good idea very much depends on your cache architecture), do it completely seperate, or interleave it with the blurring and do it every few lines (preferably enough to be writing whole cache lines for every horizontal strip of pixels that's being output).

transposing is a lot more convenient than doing 2 seperate loops (especially if you're writing a 64k where the extra code counts :), has nice performance and can be implemented in a quite machine-independent way (the only assumption being that the machine HAS a cache :), so i'd go for that.

added on the 2007-03-19 19:10:54 by ryg

Rubicon 2 (win32 port) uses PTC

added on the 2007-03-19 20:28:56 by bdk

ok seems i have some optimisation to do ;-) .. what about gpu blurs guys?

added on the 2007-03-19 21:23:14 by nystep

ryg: if i understood him correctly, he already does the transposing on write-back...

added on the 2007-03-19 23:27:49 by kusma

@ryg: Nice to see I'm not completey stupid as I've been already transposing the image on the write back from the horizontal pass :)
@kusma:
If you guys are interested we can move to some forum so I can post my code and we can discuss it further ;)

I've just looked at the code again and I've recognized that I've been doing all the r/g/b components of the 16 bit color with seperate mul/add/subs for clearity and I've never tried to optimize this further as I was convinced the cache would be the bottleneck :S

added on the 2007-03-20 10:52:13 by pailes_

the right thing to do is use a profiler to find out where the bottleneck is :)

added on the 2007-03-20 11:11:27 by ryg

pouët.net

ptc & pixeltoaster demo's

login