phöng shading and clipping?
category: general [glöplog]
http://www.jeuxvideo.fr/odema-un-concurrent-pour-rayman-actu-24257.html
i liked to code on gba, it's really cool:) Where is your texture data? In the fast memory?
One thing that i use and could be really nice on gba, it's to use vq texture compression it reduce by 3 the size of my texture. You will have one more indirection in your texture innerloop but you will save lot of memory.
For the perfect sorting, my idea is to use large tile and to sort polygon inside the tile, after i merge all tile result in one perfect list.
It's a more or less a dream for me to have a perfect sorting without zbuffer
i liked to code on gba, it's really cool:) Where is your texture data? In the fast memory?
One thing that i use and could be really nice on gba, it's to use vq texture compression it reduce by 3 the size of my texture. You will have one more indirection in your texture innerloop but you will save lot of memory.
For the perfect sorting, my idea is to use large tile and to sort polygon inside the tile, after i merge all tile result in one perfect list.
It's a more or less a dream for me to have a perfect sorting without zbuffer
Quote:
http://www.jeuxvideo.fr/odema-un-concurrent-pour-rayman-actu-24257.html
i liked to code on gba, it's really cool:)
Were you involved in that one? Looks nice!
Quote:
Where is your texture data? In the fast memory?
To be honest: At the moment I don't know. :D I'm still learning...
Quote:
It's a more or less a dream for me to have a perfect sorting without zbuffer.
Oh yes, sir! ;) Not easy though.
I coded the engine and the game, it was not released :(
32ko seems so few for texture storage.
32ko seems so few for texture storage.
Yeah. Considering that compression might be a good idea. Also because you could put the codewords into tiles.
rarefluid: two multiplies per interpolated value, actually (vert[0].attr + u*(vert[1].attr-vert[0].attr) + v*(vert[2].attr-vert[0].attr)), and yeah, you need the recriprocal of w if you want to interpolate with perspective correction.
the main case where this is interesting is when you have hw support for dot products or multiply-accumulate. less values to interpolate per pixel and a very straightforward, easy-to-parallelize dataflow.
the main case where this is interesting is when you have hw support for dot products or multiply-accumulate. less values to interpolate per pixel and a very straightforward, easy-to-parallelize dataflow.
ryg: I read about the edge inside thing again for partially filled blocks. I don't think there's much benefit, because there's some checks and setup to do.
Maybe dynamic compilation? GBA has no cache anyway.
Maybe dynamic compilation? GBA has no cache anyway.
ryg: might be nice with SSE.
The possibilities are really endless :) but time to try then is sadly limited ;)
The possibilities are really endless :) but time to try then is sadly limited ;)
what checks and setup?
In the barycentric rasterizer design, you always need to have at least two edge functions ready so you have the barycentric coords (you don't need to do any per-pixel tests on them, though).
Code:
if(((CX1+off0) | (CX2+off1) | (CX3+off2)) >= 0)
{
int variant = ((CX1+off3 >= 0)<<2) + ((CX2+off4 >= 0)<<1) + ((CX3+off5 >= 0)<<0);
switch(variant)
{
case 0:
// need to test all three edge functions (small triangle)
break;
case 1: case 2: case 4:
// only two edge functions to test
// permute the variables accordingly (into temps), then jump
// into common inner loop
break;
case 3: case 5: case 6:
// one edge function to test, works similar to above three cases
break;
case 7:
// block fully covered, don't need any edge functions at all
}
}
In the barycentric rasterizer design, you always need to have at least two edge functions ready so you have the barycentric coords (you don't need to do any per-pixel tests on them, though).
why you don't need to do per pixel test????? in the case 0,1,2,3,4,5,6 you need no??
I think i miss something. Btw i like the 3 OR optimisation
I think i miss something. Btw i like the 3 OR optimisation
there's no per-pixel tests only if the block is completely in (ie case 7). case 0 is the "regular" one where you have the three half-edge functions, cases 1,2,4 are equivalent and only do per-pixel tests for two of the three functions (these cases are taken if and only if there's two edges intersecting a given block), cases 3,5,6 are the same with just one per-pixel test for one single edge function (just one edge intersecting the block) and 7 is the trivial one where none of the edges intersects the block.
Code:
if(((CX1+off0) | (CX2+off1) | (CX3+off2)) >= 0)
{
int variant = ((CX1+off3 >= 0)<<2) + ((CX2+off4 >= 0)<<1) + ((CX3+off5 >= 0)<<0);
switch(variant)
{
case 0:
//regular loop
int tcx1 = CX1;
int tcx2 = CX2;
int tcx3 = CX3;
for Y
for X
if ((tcx1 | tcx2 | tcx3) >= 0)
//draw pixel
break;
case 1: case 2: case 4:
int tcx1;
int tcx2;
int tfdx1;
int tfdx2;
int tfdy1;
int tfdy2;
if (variant | 4) {
tcx1 = CX1;
tfxd1 = DX12;
tfdy1 = DY12 - DX12 << 3;
}
else {
tcx1 = CX2;
tfxd1 = DX23;
tfdy1 = DY23 - DX12 << 3;
}
if (variant | 1) {
tcx2 = CX3;
tfxd2 = DX31;
tfdy2 = DY31 - DX31 << 3;
}
else {
tcx2 = CX2;
tfxd2 = DX23;
tfdy2 = DY23 - DX23 << 3;
}
for Y
for X
if ((tcx1 | tcx2) >= 0)
//draw pixel
add tfcx1 and tfcx2
add tfcy1 and tfcy2
break;
case 3: case 5: case 6:
// one edge function to test, works similar to above three cases
break;
case 7:
// block fully covered, don't need any edge functions at all
}
}
untested, off the top of my head.
In the 2-edge case you'd save 64 ORs and 64 ADDs, and loose some time on the simple setup. The 1-edge case makes this really worthwhile, because you'd also only need 3 temporaries.
if (variant | 4)
and
if (variant | 1)
will always return true. Did you mean & (bitwise and)?
and
if (variant | 1)
will always return true. Did you mean & (bitwise and)?
uhm yeah. :D
got another one :)
as i mentioned earlier, C:=C1+C2+C3 is a per-triangle constant. this is pretty easy to verify since DX12+DX23+DX31=(x2-x1)+(x3-x2)+(x1-x3)=0 and the same for y.
this is not hugely interesting for the innermost loop since computing e.g. C3 as C-C1-C2 takes one more add than just stepping it would, and it uses the same amount of registers (instead of DX31, you now need to keep C). you can also update C3 as "C3 -= DX12; C3 -= DX23;" (since DX31=-DX12-DX23), trading one extra add for a register, which might help if they're tight. in any case, this is obviously incompatible with the 2-edge and 1-edge loops.
the real win for this is in the outer loops (per-block and maybe per-line), though. there's a whole bunch of extra increments that are rarely used, and trading their registers against an extra sub outside the inner loop should be a pretty good deal.
as i mentioned earlier, C:=C1+C2+C3 is a per-triangle constant. this is pretty easy to verify since DX12+DX23+DX31=(x2-x1)+(x3-x2)+(x1-x3)=0 and the same for y.
this is not hugely interesting for the innermost loop since computing e.g. C3 as C-C1-C2 takes one more add than just stepping it would, and it uses the same amount of registers (instead of DX31, you now need to keep C). you can also update C3 as "C3 -= DX12; C3 -= DX23;" (since DX31=-DX12-DX23), trading one extra add for a register, which might help if they're tight. in any case, this is obviously incompatible with the 2-edge and 1-edge loops.
the real win for this is in the outer loops (per-block and maybe per-line), though. there's a whole bunch of extra increments that are rarely used, and trading their registers against an extra sub outside the inner loop should be a pretty good deal.