chunky to planar
category: code [glöplog]
Well, I fail to see what is new - apart from using fewer bitplanes.
Oswald: OooO_ooh.. The hate is swelling in you now.
Really, C64 demos are lame. It's like they all use the same palette, and only like 16 cols or summin. Like a little bit of originality would kill you. I also think if the C64 coders knew more about cache optimization, that'd help a lot. And I hate how AHX is so popular on the C64. I mean, it's a nice sound and all, but really, it's been done too much already.
Really, C64 demos are lame. It's like they all use the same palette, and only like 16 cols or summin. Like a little bit of originality would kill you. I also think if the C64 coders knew more about cache optimization, that'd help a lot. And I hate how AHX is so popular on the C64. I mean, it's a nice sound and all, but really, it's been done too much already.
Stelthz: The speed! And the color adjust prediction trick for the second pixel is new. In the ludde c2p a Spritescreen was used to mask an ugly black pixel mask, I use the 5th bitplane.
.
The cpu pass is actually just or'ing the nibbles together and not a real merge.. Ludde's 1995 c2p had around 30 instructions pr longwordwrite.
(old c2p merge technology and "real merges").
This c2p loop can be unrolled for faster results.
.swap4c2p
movem.l (a0)+,d0-d1
lsl.l #4,d0
lsl.l #4,d1
or.l (a0)+,d0
or.l (a0)+,d1
movem.l d0-d1,(a1)+
cmpa.l a2,a1
bne.b .swap4c2p
To make an even faster c2p, one blitter pass can be removed by rearranging the byteorder of the chunkybuffer (scrambling)
.
The cpu pass is actually just or'ing the nibbles together and not a real merge.. Ludde's 1995 c2p had around 30 instructions pr longwordwrite.
(old c2p merge technology and "real merges").
This c2p loop can be unrolled for faster results.
.swap4c2p
movem.l (a0)+,d0-d1
lsl.l #4,d0
lsl.l #4,d1
or.l (a0)+,d0
or.l (a0)+,d1
movem.l d0-d1,(a1)+
cmpa.l a2,a1
bne.b .swap4c2p
To make an even faster c2p, one blitter pass can be removed by rearranging the byteorder of the chunkybuffer (scrambling)
I had to do some modifications.. This might be done faster on the Mc68000 by moving words from the chunkybuffer. 14 instructions for 16 pixels is still pretty good.
Code CPU pass. not optimized. (Not pipelined for 020+)
.swap4c2p
movem.l (a0)+,d0-d3
move.l d1,d4 ;swap16 (1X2) (3X4)
move.l d2,d5
move.w d0,d4
move.w d3,d5
swap d4
swap d5
move.w d4,d0 ;1
move.w d5,d2 ;3
move.w d1,d4 ;2
move.w d3,d5 ;4
lsl.l #4,d0 ;Or nibbles (swap4)
lsl.l #4,d2
or.l d0,d4
or.l d2,d5
movem.l d4-d5,(a1)+
dbf d7,.swap4c2p
Code CPU pass. not optimized. (Not pipelined for 020+)
.swap4c2p
movem.l (a0)+,d0-d3
move.l d1,d4 ;swap16 (1X2) (3X4)
move.l d2,d5
move.w d0,d4
move.w d3,d5
swap d4
swap d5
move.w d4,d0 ;1
move.w d5,d2 ;3
move.w d1,d4 ;2
move.w d3,d5 ;4
lsl.l #4,d0 ;Or nibbles (swap4)
lsl.l #4,d2
or.l d0,d4
or.l d2,d5
movem.l d4-d5,(a1)+
dbf d7,.swap4c2p
Quote:
movem.l d4-d5,(a1)+
I'd like to see you doing that. :D
re: Stingray movem.l d4/d5,(a1)+ ? or 2 moves.
.
This might be a faster Mc68000 loop that will do the same thing as the code above.
.loop
REPT 4
move.w (a0)+,d0
lsl.w #4,d0
or.w (a0)+,d0
move.w d0,(a1)+
ENDR
dbf .loop
.
This might be a faster Mc68000 loop that will do the same thing as the code above.
.loop
REPT 4
move.w (a0)+,d0
lsl.w #4,d0
or.w (a0)+,d0
move.w d0,(a1)+
ENDR
dbf .loop
hehe, I checked now. only -(a1) is legal.. So two moves
Exactly. :D
Fullscreen 160*128 c2p timings (winuae match a500 speed)
Timed with CIA timer. I need to run the test on a real a500. anyone?
Scrambled loop: 215 rasterlines (first suggestion)
Longword loop: 282 raterlines
...
wordloop: 354 rastelines
Timed with CIA timer. I need to run the test on a real a500. anyone?
Scrambled loop: 215 rasterlines (first suggestion)
Longword loop: 282 raterlines
...
wordloop: 354 rastelines
I can test it on one of my A500's here if you like?
sp, these measures are for only the c2p, or the combined routine + c2p? I should dig out my sources from trashcan 3 intro to "rediscover" how fast my c2p'ing was back then...
I recall the 160x100 size 2x2 res 2 bpl rotozoomer ran at 25fps...
doom: I reassure the opinion that we are entitled to * H * A * T * E * you for the next 5 years for reinserting the c2p drama into our scene :P
I recall the 160x100 size 2x2 res 2 bpl rotozoomer ran at 25fps...
doom: I reassure the opinion that we are entitled to * H * A * T * E * you for the next 5 years for reinserting the c2p drama into our scene :P
sp+stingray, something else to test... i recall a real a500 ran a bit slower when upping from 4bpl to 5bpl, and then really a lot slower when upping from 5bpl to 6bpl... so if UAE is not doing cycle-exact chipmem-bus emulation, it will definitely change the timing on the real machine...
Winden: exactly! :)
Doom, you are an ugly troll.
Really interesant thread. Now I understand how a simple A500 could do that fast flat polyfillers for few colors...
Good reason to code another A500 killer demo. Its too bad so few people do. A500 would deserve to live as a demo platform. Its much better defined than some crazy PPC amiga with user numbers with two digits.
texel: hooray for xor fillers.
(and no, that's not an amiga invention, this technique is pretty old :)
(and no, that's not an amiga invention, this technique is pretty old :)
Well, it's been another 5 years, why not bump it ?
So, since then, what's the state of the art of c2p ? I just discovered the amiga cd32 had it in hardware, but everyone says it's useless. Then @lx bumped me to this thread, sorry.
So, since then, what's the state of the art of c2p ? I just discovered the amiga cd32 had it in hardware, but everyone says it's useless. Then @lx bumped me to this thread, sorry.
...I'm still doing my stuff on bitplanes and pestering the custom chips ... and I'm happy with it
Thread revival ftw! :)
My approach is "Another intro, another C2P". The interesting part is not as much the C2P itself as the things that are mixed into it to fill the cycles otherwise wasted waiting for chip writes to complete.
To give some examples, here's what I have merged into the C2P in some of my intros:
Ikanim: Motion blur
Noxie: 4-point radial or axial blur, dithering
Planet Loonies: Antialiased downscaling
Rapo Diablo 5000: Color interpolation and environment mapping
Luminagia: Dithering (interleaved with rendering per scanline; final pass and expansion to double width done by the blitter)
Ikadalawampu: Clamping, dithering
Finding something suitable to interleave with the C2P is the heart of "modern" Amiga coding IMO.
My approach is "Another intro, another C2P". The interesting part is not as much the C2P itself as the things that are mixed into it to fill the cycles otherwise wasted waiting for chip writes to complete.
To give some examples, here's what I have merged into the C2P in some of my intros:
Ikanim: Motion blur
Noxie: 4-point radial or axial blur, dithering
Planet Loonies: Antialiased downscaling
Rapo Diablo 5000: Color interpolation and environment mapping
Luminagia: Dithering (interleaved with rendering per scanline; final pass and expansion to double width done by the blitter)
Ikadalawampu: Clamping, dithering
Finding something suitable to interleave with the C2P is the heart of "modern" Amiga coding IMO.
msk, read my answer on demoscene.fr, and the links to ada's forum :)
by the way, blueberry, I can't find the thread on ada's forum where windden gave that clamping trick with no tests on integers ... where is that ?
"modern" Amiga coding *chuckles*
@krabob: IIRC it's hidden somewhere here:
http://graphics.stanford.edu/~seander/bithacks.html
http://graphics.stanford.edu/~seander/bithacks.html
krabob: Are you thinking of this thread? This was indeed the thread that gave me the idea for the clamping in Ikadalawampu, which in turn inspired me to the whole rendering engine. :)