phöng shading and clipping?

category: general [glöplog]

http://www.jeuxvideo.fr/odema-un-concurrent-pour-rayman-actu-24257.html

i liked to code on gba, it's really cool:) Where is your texture data? In the fast memory?

One thing that i use and could be really nice on gba, it's to use vq texture compression it reduce by 3 the size of my texture. You will have one more indirection in your texture innerloop but you will save lot of memory.

For the perfect sorting, my idea is to use large tile and to sort polygon inside the tile, after i merge all tile result in one perfect list.

It's a more or less a dream for me to have a perfect sorting without zbuffer

added on the 2009-02-13 22:34:05 by Jamie2009

Quote:

http://www.jeuxvideo.fr/odema-un-concurrent-pour-rayman-actu-24257.html

i liked to code on gba, it's really cool:)

Were you involved in that one? Looks nice!

Quote:

Where is your texture data? In the fast memory?

To be honest: At the moment I don't know. :D I'm still learning...

Quote:

It's a more or less a dream for me to have a perfect sorting without zbuffer.

Oh yes, sir! ;) Not easy though.

added on the 2009-02-13 22:45:04 by raer

I coded the engine and the game, it was not released :(

32ko seems so few for texture storage.

added on the 2009-02-13 22:47:44 by Jamie2009

Yeah. Considering that compression might be a good idea. Also because you could put the codewords into tiles.

added on the 2009-02-13 22:52:15 by raer

rarefluid: two multiplies per interpolated value, actually (vert[0].attr + u*(vert[1].attr-vert[0].attr) + v*(vert[2].attr-vert[0].attr)), and yeah, you need the recriprocal of w if you want to interpolate with perspective correction.

the main case where this is interesting is when you have hw support for dot products or multiply-accumulate. less values to interpolate per pixel and a very straightforward, easy-to-parallelize dataflow.

added on the 2009-02-13 23:14:31 by ryg

ryg: I read about the edge inside thing again for partially filled blocks. I don't think there's much benefit, because there's some checks and setup to do.
Maybe dynamic compilation? GBA has no cache anyway.

added on the 2009-02-13 23:28:12 by raer

ryg: might be nice with SSE.

The possibilities are really endless :) but time to try then is sadly limited ;)

added on the 2009-02-13 23:31:56 by raer

what checks and setup?

Code:


if(((CX1+off0) | (CX2+off1) | (CX3+off2)) >= 0)
{
  int variant = ((CX1+off3 >= 0)<<2) + ((CX2+off4 >= 0)<<1) + ((CX3+off5 >= 0)<<0);

  switch(variant)
  {
  case 0:
    // need to test all three edge functions (small triangle)
    break;

  case 1: case 2: case 4:
    // only two edge functions to test
    // permute the variables accordingly (into temps), then jump
    // into common inner loop
    break;

  case 3: case 5: case 6:
    // one edge function to test, works similar to above three cases
    break;

  case 7:
    // block fully covered, don't need any edge functions at all
  }
}

In the barycentric rasterizer design, you always need to have at least two edge functions ready so you have the barycentric coords (you don't need to do any per-pixel tests on them, though).

added on the 2009-02-14 00:00:42 by ryg

why you don't need to do per pixel test????? in the case 0,1,2,3,4,5,6 you need no??

I think i miss something. Btw i like the 3 OR optimisation

added on the 2009-02-14 00:18:33 by Jamie2009

there's no per-pixel tests only if the block is completely in (ie case 7). case 0 is the "regular" one where you have the three half-edge functions, cases 1,2,4 are equivalent and only do per-pixel tests for two of the three functions (these cases are taken if and only if there's two edges intersecting a given block), cases 3,5,6 are the same with just one per-pixel test for one single edge function (just one edge intersecting the block) and 7 is the trivial one where none of the edges intersects the block.

added on the 2009-02-14 01:16:22 by ryg

Code:


if(((CX1+off0) | (CX2+off1) | (CX3+off2)) >= 0)
{
  int variant = ((CX1+off3 >= 0)<<2) + ((CX2+off4 >= 0)<<1) + ((CX3+off5 >= 0)<<0);

  switch(variant)
  {
  case 0:
    //regular loop
    int tcx1 = CX1;
    int tcx2 = CX2;
    int tcx3 = CX3;
    for Y
       for X
          if ((tcx1 | tcx2 | tcx3) >= 0)
             //draw pixel
    break;

  case 1:  case 2:  case 4:
    int tcx1;
    int tcx2;
    int tfdx1;
    int tfdx2;
    int tfdy1;
    int tfdy2;
    if (variant | 4) {
       tcx1 = CX1;
       tfxd1 = DX12;
       tfdy1 = DY12 - DX12 << 3;
    }
    else {
       tcx1 = CX2;
       tfxd1 = DX23;
       tfdy1 = DY23 - DX12 << 3;       
    }
    if (variant | 1) {
       tcx2 = CX3;
       tfxd2 = DX31;
       tfdy2 = DY31 - DX31 << 3;
    }
    else {
       tcx2 = CX2;
       tfxd2 = DX23;
       tfdy2 = DY23 - DX23 << 3;       
    }
    for Y
       for X
          if ((tcx1 | tcx2) >= 0)
             //draw pixel
       add tfcx1 and tfcx2
    add tfcy1 and tfcy2
    break;

  case 3: case 5: case 6:
    // one edge function to test, works similar to above three cases
    break;

  case 7:
    // block fully covered, don't need any edge functions at all
  }
}

untested, off the top of my head.

In the 2-edge case you'd save 64 ORs and 64 ADDs, and loose some time on the simple setup. The 1-edge case makes this really worthwhile, because you'd also only need 3 temporaries.

added on the 2009-02-14 14:01:59 by raer

if (variant | 4)

and

if (variant | 1)

will always return true. Did you mean & (bitwise and)?

added on the 2009-02-14 21:55:48 by Lord Graga

uhm yeah. :D

added on the 2009-02-15 02:12:37 by raer

got another one :)

as i mentioned earlier, C:=C1+C2+C3 is a per-triangle constant. this is pretty easy to verify since DX12+DX23+DX31=(x2-x1)+(x3-x2)+(x1-x3)=0 and the same for y.

this is not hugely interesting for the innermost loop since computing e.g. C3 as C-C1-C2 takes one more add than just stepping it would, and it uses the same amount of registers (instead of DX31, you now need to keep C). you can also update C3 as "C3 -= DX12; C3 -= DX23;" (since DX31=-DX12-DX23), trading one extra add for a register, which might help if they're tight. in any case, this is obviously incompatible with the 2-edge and 1-edge loops.

the real win for this is in the outer loops (per-block and maybe per-line), though. there's a whole bunch of extra increments that are rarely used, and trading their registers against an extra sub outside the inner loop should be a pretty good deal.

added on the 2009-08-30 13:59:42 by ryg

pouët.net

phöng shading and clipping?

login