The ASM instruction you always wanted, but never had?
category: code [glöplog]
Quote:
hm, I guess a nice shuffle instruction would speed up chunky-to-planar routines on the Amiga quite a bit, no?
Depends. There's the memory bandwidth issue Psycho mentioned. (Chipram is *slow*).
Also, shuffle instructions are usually not as awesome as you'd like them to be. This is bumping into another limit - number of register read/write ports (=number of regs that can be read/written in a single cycle). The problem is that any transpose-style operation scatters values from one register with N independent "fields" (whether it be single bits of a pixel like in C2P or multiple bits like when doing a float32 matrix transpose) into N other registers. In a 2-operand ISA with one register write per instruction, you still need 2*log2(N) "shuffle step" style ops and log2(N) moves to do a full transpose. You can work around the ISA issue by having a special opcode that requires a specific data layout in registers, but the port limit is substantially harder (and register file area grows with the square of the number of ports!). You can microcode it and get rid of the moves (which are not a big problem in a superscalar processor because they are independent and can be paired with something else). But that's still 2*log2(N) cycles in the "execute" stage, and in an in-order superscalar design, it will most likely be a non-pairable instruction.
In short, you're unlikely to get a real win from this, and it's an awfully specialized opcode. What you really want for this kind of functionality is some simple fixed-function asynchronous DMA engine that can do the C2P on the fly, with none of the problems mentioned above.
But it's very easy to screw that up as well :)
Since its all fantasy anyway, what you really want is a chunky mode in the AGA chipset. Oh, and 3d transform and texture mapping in hw, too.
many c64 effects could be 20-30% faster just by having one more index register, or having stuff like lda ($..),x. I have spent many many hours trying to optimize 'one more register please' or 'why doesnt that adressing mode exists' in a statisfactory way.
Something that would also help on the 6502 is to be able to set the flags without having to overwrite a register, the equivalent of the "tst" instruction on 68000.
Chunky mode or faster C2P won't do much difference. A full c2p of 320x200 pixels takes about 20% of a frame. What we really need is faster chip ram. Except that would spoil all the fun of interleaving computations with the chip ram writes. :)
PulkoMandy, because you can use EX DE, HL. ;)
move.l godis,oron
move.l demo,screen
rts
move.l demo,screen
rts
Real 16-bit adressing in C64... or floating (or some kind of fixed) point in both Z80 and C64...
saturated add on 68k would have been nice.
Can someone explain to me why interleaving instructions with writes to video RAM was the fastest method on Amiga? I have similarly slow video ram on my platform and was curious if I could use the same method -- but I need to understand the rationale first.
trixter: not knowing the platform, sounds like the video memory has some waitstates and the memory bus is busy whenever you write to it, leaving some extra cycles for the cpu.
Platform is IBM CGA. Adds a single wait state. I don't think anyone's ever done experiments to find out if interleaving is faster than writing to system ram then doing REP MOVSW so it looks like I'll have to check it out myself. I was just curious about the Amiga details.
@Zerkman:
RNG in only one instruction (iirc it was devised by Pervect/Topix):
But well, i would add nothing to an ARM processors, except people using it instead of those fucking !ntel processors! :(
Quote:
Code:add r0, r0, r0 lsl #8 multiplies r0 by 257 mod 2^32 add r0, r0, #47 adds a prime number
RNG in only one instruction (iirc it was devised by Pervect/Topix):
Code:
REM Creates a new random number in m0, affects flags
DEFFNrandom(m0):[opt opt%:rsb m0,m0,m0,ror#11:]
But well, i would add nothing to an ARM processors, except people using it instead of those fucking !ntel processors! :(
@trixter: 68k cpus have plenty of regs compared to contemporary x86's, though, so interleaving ops is more complicated.. =)
Since when is rep movsw valid for 68k? :P
Way to put words in my mouth, dude =)
For Intel and maybe ARM too:
Code:
INLINE java
System.out.println("here be the cool code");
END INLINE
pants_off
t
POST topic_id, $pointer_to_raw_image_data