Sorting particles (for Flash Molehill)
category: code [glöplog]
This seems to be an issue discussed every now and then, but I'll poke at it again anyway, because of new platform constraints
I'm trying to figure out fast way of doing alpha blended particles on Flash 11 "molehill", and I just realized that I actually have to presort the particles before inputing them to shaders.
Now what I don't have:
- CPU speed: slow as molasses (AS3) compared to anything on GPU side.
- Texture Readback: Only by rendering to backbuffer, requesting copy and reading pixel by pixel, using BitmapData#getPixel32, with premultiplied alpha of course! SLOW
- Textures as data input: Nope, vertex shaders don't allow texture sampling
What I'm attempting to do is depth of field on particles, so my first idea was just texture atlasing 8x16 images of particle blurred from different sizes and scaled back to uniform size. This works if I get sorting to work. Currently it draws around 300-500k particles when using trivial blend mode (1->1)
My other plan is just forget alpha blending, do alpha-kill in fragment shader and then use depth buffer to blur the image different amounts, though this probably causes wierd bleeding, but could be fast. Also it limits me on particles graphics I want to use somewhat.
If I do sorting, I need to transform all particles, not just emitter to view space every frame just for sorting? Also this means that there is little chance of doing stateless-PS on GPU, so I get to calculate also positions on CPU, dropping my particle count back to around 30-40k particles :F
Oh, I could use volume sort and just move particles in more limited way, this requires 48 index lists? (Oh, single index buffer can contain just 65k indices, so you need to batch draw stuff, same thing with vertex buffer, just 65k vertices)
I'm trying to figure out fast way of doing alpha blended particles on Flash 11 "molehill", and I just realized that I actually have to presort the particles before inputing them to shaders.
Now what I don't have:
- CPU speed: slow as molasses (AS3) compared to anything on GPU side.
- Texture Readback: Only by rendering to backbuffer, requesting copy and reading pixel by pixel, using BitmapData#getPixel32, with premultiplied alpha of course! SLOW
- Textures as data input: Nope, vertex shaders don't allow texture sampling
What I'm attempting to do is depth of field on particles, so my first idea was just texture atlasing 8x16 images of particle blurred from different sizes and scaled back to uniform size. This works if I get sorting to work. Currently it draws around 300-500k particles when using trivial blend mode (1->1)
My other plan is just forget alpha blending, do alpha-kill in fragment shader and then use depth buffer to blur the image different amounts, though this probably causes wierd bleeding, but could be fast. Also it limits me on particles graphics I want to use somewhat.
If I do sorting, I need to transform all particles, not just emitter to view space every frame just for sorting? Also this means that there is little chance of doing stateless-PS on GPU, so I get to calculate also positions on CPU, dropping my particle count back to around 30-40k particles :F
Oh, I could use volume sort and just move particles in more limited way, this requires 48 index lists? (Oh, single index buffer can contain just 65k indices, so you need to batch draw stuff, same thing with vertex buffer, just 65k vertices)
I haven't read your problem throughoutly, but check out smash's blog (directtovideo.wordpress.com), I think he has written something about particle sorting there once.
Have you looked at PixelBender to run simple shaders on the CPU?
http://life.neophi.com/danielr/2008/11/sorting_out_pixels_or_how_i_di.html
http://www.adobe.com/devnet/pixelbender/articles/animating-particle-system.html
http://life.neophi.com/danielr/2008/11/sorting_out_pixels_or_how_i_di.html
http://www.adobe.com/devnet/pixelbender/articles/animating-particle-system.html
I wasn't aware MoleHill was out. Do we really need that now that only Internet Explorer has no WebGL support?
you could drop the sorting and go for 50% / 50% blend - that's what's done in most demos up until 2007
xernobyl: well, there's WebGL support, and then there is "WebGL support" - http://caniuse.com/#search=webgl
Yeah, molehill is out, as incubator release:
http://labs.adobe.com/technologies/flashplatformruntimes/incubator/
Proper shading language is still in such preview mode that I trust my AGAL bytecodes and write in AGAL assembler, more control.
Yeah, I could sort the particles and animate them in pixel bender. Spent yesterday night porting floating point radix sort to AS3 + TDSI combo. Sorts about 70k particles in 20ms -> 40fps if nothing else eats no time.
http://labs.adobe.com/technologies/flashplatformruntimes/incubator/
Proper shading language is still in such preview mode that I trust my AGAL bytecodes and write in AGAL assembler, more control.
Yeah, I could sort the particles and animate them in pixel bender. Spent yesterday night porting floating point radix sort to AS3 + TDSI combo. Sorts about 70k particles in 20ms -> 40fps if nothing else eats no time.
Perhaps it's possible to use multisampling + alpha-to-coverage functionality to do the effect?
kusma: is that the technique that requires dx where you have good control over the mask? (I was looking into alpha-to-coverage and how it works in opengl vs. dx and it seemed that in opengl has a much less control over the mask.) Just curious.
hyde: No, it's supported in OpenGL through ARB_multisample. glEnable(GL_SAMPLE_ALPHA_TO_COVERAGE_ARB) and glEnable(GL_SAMPLE_ALPHA_TO_ONE_ARB), and you should be good to go.
Through GL_ARB_multisample, that is.
You can avoid much of the sorting overhead (in particular memory traffic) by using a plain bucket sort where you fold the first pass into the particle simulation and the second pass into the uploading to the vertex buffer. It is not perfect, as particles that end up in the same buffer will not be sorted relative to each other, but if you have enough buckets, it will work just fine. ;)
Thus:
First pass:
For each particle:
- Calculate particle position
- Calculate appropriate bucket for particle
- Put particle directly into bucket
Second pass:
For each bucket:
- Copy bucket contents to vertex buffer
Thus:
First pass:
For each particle:
- Calculate particle position
- Calculate appropriate bucket for particle
- Put particle directly into bucket
Second pass:
For each bucket:
- Copy bucket contents to vertex buffer
Blueberry, shouldn't that include a pass to find the z-range first (which often can be optimized away)? For a 4k you might just want to generate your particles in some pre-defined space, but if code-size isn't a factor I doubt that's worth it...
what about this: http://developer.download.nvidia.com/SDK/10.5/opengl/src/dual_depth_peeling/doc/DualDepthPeeling.pdf
Also Molehill != OpenGL, it's closest featureset is OpenGL ES2.0 (with almost analogous API). Shaders are pretty much SM2.x stuff
Also: 256 instruction limit..
Also Molehill != OpenGL, it's closest featureset is OpenGL ES2.0 (with almost analogous API). Shaders are pretty much SM2.x stuff
Also: 256 instruction limit..
Quote:
I wasn't aware MoleHill was out. Do we really need that now that only Internet Explorer has no WebGL support?
I was going to say of course we need even more proprietary standards, security holes + crashes from adobe, but this morning I see a bunch of security advisories recommending that webgl is disabled because there's holes in it. Wake me up when it's all fixed.
Kusma: Optimizing the z-range will give a better quality sort of course, but adding another pass adds more memory traffic, which is easily the bottleneck in a sort like this. How were you thinking of optimizing it away?
You could keep track of the z range in the first pass and use it to restrict the range of buckets traversed in the second pass, but traversing empty buckets is quite cheap if you keep a separate list of the number of particles in each bucket, so I doubt you win much.
It all depends on how flexible the renderer needs to be, of course.
You could keep track of the z range in the first pass and use it to restrict the range of buckets traversed in the second pass, but traversing empty buckets is quite cheap if you keep a separate list of the number of particles in each bucket, so I doubt you win much.
It all depends on how flexible the renderer needs to be, of course.
jalava: Depth peeling is quite costy (on the GPU) unless you implement some clever early outs.
Even OpenGL ES supports coverage-to-alpha, though. So don't rule it out for that reason, check if molehill can do it ;)
Another alternative can be to do dithered alpha-test and post-process blur it to get rid of the dithering, but I doubt you'll get that looking good. You'd need a wide filter to account for 1 bit alpha, and then you'd lose too much relevant high frequencies.
If the particle placement is static, then you can partition them in a 3d-grid, and pre-sort all particles in each grid-node in a set of directions (+/-xyz is probably enough). Traverse the grid one in a sorted order (you can look at the relationships between the x,y and z components, and walk from the node closest to the camera), and render the pre-sorting that best fits the view direction.
Even OpenGL ES supports coverage-to-alpha, though. So don't rule it out for that reason, check if molehill can do it ;)
Another alternative can be to do dithered alpha-test and post-process blur it to get rid of the dithering, but I doubt you'll get that looking good. You'd need a wide filter to account for 1 bit alpha, and then you'd lose too much relevant high frequencies.
If the particle placement is static, then you can partition them in a 3d-grid, and pre-sort all particles in each grid-node in a set of directions (+/-xyz is probably enough). Traverse the grid one in a sorted order (you can look at the relationships between the x,y and z components, and walk from the node closest to the camera), and render the pre-sorting that best fits the view direction.
Blueberry: my point was that you MUST have a range, otherwise you have an infinite number of buckets. If you know that you only generate particles within a range, then that's a fine optimization. And it's probably better to use a fixed-range than simply dynamically changing the bucket-positions, since that can and will lead to popping (this can of course be combated in other ways also).
running a particle sort on gpu relies on vertex texture fetch to read the results, which you dont have apparently, so its time for plan b..
you can render single points with z testing to a large render target, then downsample and blur that buffer out to give soft particles. to do it right you need to keep depths and sort the pixels with a small bucket sort. that gives you a pretty big shader but it does work, i've tried it. :) of course it means you cant texture the particles, tho. works best with small particles.
alpha to coverage can be mimicked using a large render target and dither and then a bilateral filter to generate final results. you can always do multiple passes with different dither patterns and blend the results together. it also works, ive tried it too.. :) it just gets slow because of the fillrate hammering but its a nice technique because you can deferred light the buffers too.. works best with relatively solid particles. if your particles dont move too much use temporal reprojection of the previous frame with varying dither pattern per frame - it might work. :)
ive got screenshots of most of those techniques somewhere.. (i have proof!) :)
if you want to sort properly dont forget sorting is spatial in 2d too - who cares about sorting a particle against another on the other side of the screen? so you can put the particles into 2d buckets (tiles) in screen space, and then sort particles in each bucket with something slower like a radix (or even insertion) sort on cpu.
if you are sorting on cpu use temporal coherence - split the sort over n frames and use an odd even merge sort.
also check out bitonic sort which is the fastest parallel sort. on ps3 spu i use bucket sort to split the particles up roughly and then a bitonic inside each bucket to make it accurate.
of course the best way is to work with the system / fx youve got - if you dont have much particle movement you can probably emit in rough sorted order. also a bit of innaccuracy is usually ok as long as it doesnt flicker badly.
you can render single points with z testing to a large render target, then downsample and blur that buffer out to give soft particles. to do it right you need to keep depths and sort the pixels with a small bucket sort. that gives you a pretty big shader but it does work, i've tried it. :) of course it means you cant texture the particles, tho. works best with small particles.
alpha to coverage can be mimicked using a large render target and dither and then a bilateral filter to generate final results. you can always do multiple passes with different dither patterns and blend the results together. it also works, ive tried it too.. :) it just gets slow because of the fillrate hammering but its a nice technique because you can deferred light the buffers too.. works best with relatively solid particles. if your particles dont move too much use temporal reprojection of the previous frame with varying dither pattern per frame - it might work. :)
ive got screenshots of most of those techniques somewhere.. (i have proof!) :)
if you want to sort properly dont forget sorting is spatial in 2d too - who cares about sorting a particle against another on the other side of the screen? so you can put the particles into 2d buckets (tiles) in screen space, and then sort particles in each bucket with something slower like a radix (or even insertion) sort on cpu.
if you are sorting on cpu use temporal coherence - split the sort over n frames and use an odd even merge sort.
also check out bitonic sort which is the fastest parallel sort. on ps3 spu i use bucket sort to split the particles up roughly and then a bitonic inside each bucket to make it accurate.
of course the best way is to work with the system / fx youve got - if you dont have much particle movement you can probably emit in rough sorted order. also a bit of innaccuracy is usually ok as long as it doesnt flicker badly.
smash:
About your first "plan b" suggestion: he's trying to do DOF-particles, so texturing the particles is kind of a must.
The point about 2d-overlap is a good one, but it's not as easy as you make it out to. Because the particles have a size, particles at the border between two bins can cause problems. Also, even through two particles do not overlap they can be tied to a specific sort-order because of a third particle that overlaps both, or even a chain of particles that overlap each other. Putting them in multiple bins and enabling scissoring around the bins, should fix these problems, though.
About your first "plan b" suggestion: he's trying to do DOF-particles, so texturing the particles is kind of a must.
The point about 2d-overlap is a good one, but it's not as easy as you make it out to. Because the particles have a size, particles at the border between two bins can cause problems. Also, even through two particles do not overlap they can be tied to a specific sort-order because of a third particle that overlaps both, or even a chain of particles that overlap each other. Putting them in multiple bins and enabling scissoring around the bins, should fix these problems, though.
kusma: I was thinking about the function "glSampleCoverage(..)" and how gl and dx differs in specifying the bitmask for alpha-to-coverage. In glSampleCoverage(), you supply a clamped float that "specify a single floating-point sample coverage value." whereas in DX you get to specify the bitmask explicitly.
When reading your initial post, I was wondering what algorithm you were thinking of to solve the OP's problem and if it depended on DX in a serious way.
There's an algorithm (in one particular paper that I cannot seem to find right now) that uses this bitmask in a non-trivial way. I was hoping you would link to that article and explain how one would go about implementing the same thing in GL :)
When reading your initial post, I was wondering what algorithm you were thinking of to solve the OP's problem and if it depended on DX in a serious way.
There's an algorithm (in one particular paper that I cannot seem to find right now) that uses this bitmask in a non-trivial way. I was hoping you would link to that article and explain how one would go about implementing the same thing in GL :)
Hyde: I'm not sure what algorithm you're referring to, but what I suggested is simply to enable two states and have the GPU do the work for you (generate a coverage mask with the amount of covered samples proportional to the alpha, and overriding the alpha to 1). There's a bunch of white-papers and articles on it out there, just use google.
kusma: ok, it wasn't the thing I was thinking of, but thanks.