phöng shading and clipping?

category: general [glöplog]

No more dents in Lennas face ;)
BB Image

I'll try the Triage Mask shit at the weekend.

added on the 2009-02-13 11:28:18 by raer

I would suggest that you don't spend time and power on polygon intersections. Just do it manually when the demo is done :)

As for dynamically allocating tiles, I would definitely not do that either. Spending time checking masks and looking up tiles when drawing scanlines is not really worth it, is it?

If you are using 16 color tiles, you should have more than enough VRAM for a few layers, especially if you cheat (like us) and go 240x128 instead of 240x160.

added on the 2009-02-13 11:50:48 by Lord Graga

graga: I'm rendering triangles in 8x8 pixel blocks, not in scanlines. The algorithm is here. It might be slower than the traditional scanline approach, but I think it is quite elegant. The coverage detection loop (or outer loop) for the triangle is just adding/subtraction 3 numbers, that's it.

I won't allocate the tiles dynamically, I allocate a fixed amount. When I hit at filled 8x8 block, I can just skip it. If I hit a semi-filled one I can check exactly which pixels to draw. I think it migfht be worth being able to discard whole blocks of pixels if you do complicated stuff like texture mapping or gouraud shading...

How do you use 240x128? Do you set the BG2Px registers?

In 240x128 I'd need 480 bytes for the tiles array and ~2kB for the tile masks (253*8). Not much.

Quote:

I would suggest that you don't spend time and power on polygon intersections. Just do it manually when the demo is done :)

So first throw stuff at it and the fix the glitches? :D Ok, that works too...

Btw. How do you handle transparent polys?

added on the 2009-02-13 12:38:26 by raer

I use hardware-windows to border up the screen, like this:

REG_WIN0H = 240;
REG_WIN0V = (160-BORDER_HEIGHT)|BORDER_HEIGHT<<8;
REG_WININ = 0xFF;
REG_WINOUT = 0;

That will give you a border of BORDER_HEIGHT height in both top and bottom of the screen.

Our engine does not support transparent polygons, but it could easely be implemented for our 2x1 pixel filling algorithm at the cost of just 1 cycle/pixel by skipping 0'ed pixel values. I hope that makes sense to you (If it doesn't, my bad).

I can't wait to see how you are going to abuse 3d in tiled mode :). Is it for BP?

added on the 2009-02-13 14:48:01 by Lord Graga

Quote:

I hope that makes sense to you (If it doesn't, my bad).

Uhm. Not really :D I meant like alpha-blended polygons.

I'm not using tiled mode at all. I wanted to use mode 4. I just render triangles in 8x8 pixel blocks.

Quote:

Is it for BP?

Uhm. I hope so... I have this working more or less in fixed-point on PC, so I need to port it to GBA now. Let's see if I manage to do some small wild. There's also that Heaven7-project... :/

added on the 2009-02-13 15:12:30 by raer

Quote:

The algorithm is here. It might be slower than the traditional scanline approach, but I think it is quite elegant. The coverage detection loop (or outer loop) for the triangle is just adding/subtraction 3 numbers, that's it.

Btw, the way he checks whether a block covers the triangle completely (or not at all) does a lot of unnecessary work. There's no need to check all 4 corners of the block, you can determine which corner to test from the signs of the deltas alone (Warning about notation: my "dx" and "dy" correspond to "-DY12" and "DX12" in Nicks article, respectively): A block is completely outside a given halfspace if and only if max { C + dx*x + dy*y | (x,y) in block } <=0, and to compute the max all you need to do is this (C-like pseudocode):

Code:dp = C + dx * (dx>=0 ? x1 : x0) + dy * (dy>=0 ? y1 : y0);

Which corner to use only depends on the edge function and is constant for the whole triangle, as are the differences between the different dot products for the corners, so you can reorganize matters a bit:

Code:

// during setup
int edgeMaxC = C + max(dx,0) * (q-1) + max(dy,0) * (q-1);
// during rasterization (per block)
dp = edgeMaxC + dx*y0 + dy*x0;
if(dp <= 0) reject_block(); // block completely outside this edge

The "fully-inside" test is the same with a min instead of a max, and boils down to another constant offset. The "dx*y0 + dy*x0" is the same for both, so that part only needs to be calculated once.

Presto, 6 muls 18 adds instead of 24 muls 24 adds for a block rejection test (and of course you can still do everything incrementally and shave off even more muls).

Another fun thing to try: Subtract 1 from all constants so the tests become ">=0" instead of ">0". That's a sign bit test, which leads to extra simplifications: You can do "(CX1|CX2|CX3) >=0" (yes, bitwise or) instead of "CX1>0 && CX2>0 && CX3>0", for example.

For some more extra goodness: Since ">=0" is invariant under arithmetic right shifts, there's no need to perform everything with shifted coordinates to get subpixel accurate rasterization. You perform the subpixel correction once during setup, then do the shift right, and all other steps are pixel steps (e.g. you do "CX1 -= DY12" instead of FDY12 in the inner loop). Doesn't directly make anything faster in terms of arithmetic complexity, but it's worthwhile if you're tight on registers, if your multiplies are faster with small values, if you have limited word sizes, if you want high resolutions, or of course everything at once :).

Time to get back to work.

added on the 2009-02-13 15:46:27 by ryg

:D ryg.

I do some of that. I interpolate the 3 edge values instead or recalculating them (just 3 adds and an occasional shift) and I calculate the offset I need to add per edge for testing.

Code:


if ((CX1+offset[0]) > 0 && (CX2+offset[1]) > 0 && (CX3+offset[2]) > 0)  {
   if((CX1+offset[3]) > 0 && (CX2+offset[4]) > 0 && (CX3+offset[5]) > 0) {
      //fully inside
   }
   else {
      //partially inside
   }

Your OR trick is nice. Will try that.

added on the 2009-02-13 17:13:48 by raer

Quote:

Btw. How do you handle transparent polys?

Render all tris back-to-front in order OR if you are using zbuffers you just have to draw them after normal rendering, back-to-front. That will improve speed some since alpha is a bit more expensive.

added on the 2009-02-13 17:59:44 by thec

Yeah. I know how it would be usually done. The problem ist that there's no real z-Buffer.

Maybe I could get away with rendering every opaque poly in front of the first transparent poly front to back while filling my structure and then rendering the rest back-to-front.

added on the 2009-02-13 18:30:10 by raer

I also seem to render a bit differently than OpenGL. The software rendered triangle is one pixel right and above the one OpenGL draws:
BB Image

added on the 2009-02-13 19:18:00 by raer

GL sampling rules state that the triangle is sampled at the center of fragments, i.e. the edge functions for the bottom left pixel are evaluated at (0.5,0.5) if there's no multisampling. Same rules go for D3D10 and up. D3D up to version 9 had the "pixel centers" (i.e. sampling points) at integer coordinates instead, and that's the behavior that Nick was going for since he was writing a software D3D implementation at the time.

added on the 2009-02-13 19:49:26 by ryg

Ah, ok. Didn't realize that... Thanks.

added on the 2009-02-13 20:03:32 by raer

I like this method but the case where the tile is partially covered ( in my case more that 75% ) is so slow. It could be really nice with mmx implementation but my amiga don't have this option:)

added on the 2009-02-13 21:03:58 by Jamie2009

Why is it so much slower then than the traditional approach?

added on the 2009-02-13 21:14:13 by raer

because with the tradional approach i don't need to check if the pixel need to be draw or not. Btw i use a coverage buffer with the traditional approach and i compute 2 edge with one division ( thanks Winden )

added on the 2009-02-13 21:19:11 by Jamie2009

I like this method but the case where the tile is partially covered ( in my case more that 75% ) is so slow. It could be really nice with mmx implementation but my amiga don't have this option:)

added on the 2009-02-13 21:26:50 by Jamie2009

Quote:

I'm not using tiled mode at all. I wanted to use mode 4. I just render triangles in 8x8 pixel blocks.

Tile modes are actually slightly faster for that if you are rendering linearly from top to bottom of each 8x8 block (TILE!).

Seriously, it has so many posibilities that I feel I'm shooting myself in the foot just by telling you :P

added on the 2009-02-13 21:33:13 by Lord Graga

sorry for the double post:)

Tile rendering has some advantage, i like the fact that you don't need to compute division for the edge, the clipping is so clean and easy, you can implement a cool occlusion BUT this fucking case where you need to test each pixel kill the performance ( on amiga for my case )

Now i use a special coverage buffer, where i use very well the 8ko of data cache

added on the 2009-02-13 21:39:44 by Jamie2009

other tricks: if any edge is fully inside on the per-block test, there's no point evaluating it per pixel. making different versions of the inner loop for the 0 (i.e. fully inside)/1/2/3 "active" edge cases is probably a good idea.

depending on how big your triangles are, you might want more than two levels (e.g. larrabee apparently uses three: 16x16, 4x4 and 1x1 pixel "blocks"). this and the block size is obviously something you want to experiment with (the nice thing being that changing this is easy).

interpolating attributes is another interesting bit. the edge functions you're computing are effectively scaled barycentric coordinates, so the sum of the three edge functions is always constant (twice the area of the triangle in subpixels). calculate the reciprocal of that once, and you can convert from your edge functions to "proper" (normalized) barycentric coordinates with a multiply; you can then use the barycentric coords for interpolation. mainly interesting if you have a fast multiplier or a lot of triangles that cover a very small number of pixels, where the setup overhead of a more conventional interpolation scheme wouldn't amortize.

actually, the relatively low triangle setup overhead of this algorithm is a big plus in general (plus it's much easier to get the fill convention etc. right).

as for mmx - what in particular do you think would be helpful here?

added on the 2009-02-13 21:44:59 by ryg

i'm totaly ok with this, exept for the rasterisation, you can optimise the pixel evaluating as you want it will be never faster that the classical algo.

For the mmx the idea is to fill always the full tile and when you write the tile you can mask only the visible pixel so you will have more texel acces but less comparaison and branch

added on the 2009-02-13 21:59:26 by Jamie2009

For the interpolating attribute, yep it's another question:) Personaly i compute the gradient only for the visible polygon, when i said visible i mean one pixel of the poly need to be visible, and yep i need compute the area

Another interesting thing with the old method, i use polygon ( with constant gradient of course ) and not only triangle.

added on the 2009-02-13 22:02:55 by Jamie2009

Graga: Then I'll take a look at tile modes :)
What I had thought about was filling triangles with tiles. But I'm not sure how that'll work.

Jamie: You might be right about the testing. Probably you can't have best of both worlds...

added on the 2009-02-13 22:07:01 by raer

Well, if you are naturally splitting the screen into 8x8 tiles, then I don't see a reason why you shouldn't make use of the hardware mode which does *exactly* that.

Like I said, it has so many posibilities. Here's a freebee: You can render to sprites as well as tile layers.

added on the 2009-02-13 22:11:52 by Lord Graga

personaly i believe in a hybrid version.

I try to find a way to have a perfect sorting with the tile approach with no zbuffer, because amiga is so slow for this

For the rasterisation i use the classical way with shared edge, and 2 edges in parallel. After this i have the coverage buffer pass with compute the visible span.

After i sort the polygon by texture for the a better cache use

And the "shader" pass

added on the 2009-02-13 22:17:37 by Jamie2009

Jamie: The problem on GBA is that there's really not much memory available. How does your tile approach work?

ryg: I get the idea with barycentric coordinates. Not bad actually. You'd need 3 multiplies per interpolated value. But you'd still need the reciprocal of w right? Not a good choice for GBA I think :/
Maybe one could do it tile-wise and linear interpolation in between...

added on the 2009-02-13 22:25:50 by raer

pouët.net

phöng shading and clipping?

login