pouët.net

Go to bottom

The ASM instruction you always wanted, but never had?

category: code [glöplog]
Quote:
hm, I guess a nice shuffle instruction would speed up chunky-to-planar routines on the Amiga quite a bit, no?

Depends. There's the memory bandwidth issue Psycho mentioned. (Chipram is *slow*).

Also, shuffle instructions are usually not as awesome as you'd like them to be. This is bumping into another limit - number of register read/write ports (=number of regs that can be read/written in a single cycle). The problem is that any transpose-style operation scatters values from one register with N independent "fields" (whether it be single bits of a pixel like in C2P or multiple bits like when doing a float32 matrix transpose) into N other registers. In a 2-operand ISA with one register write per instruction, you still need 2*log2(N) "shuffle step" style ops and log2(N) moves to do a full transpose. You can work around the ISA issue by having a special opcode that requires a specific data layout in registers, but the port limit is substantially harder (and register file area grows with the square of the number of ports!). You can microcode it and get rid of the moves (which are not a big problem in a superscalar processor because they are independent and can be paired with something else). But that's still 2*log2(N) cycles in the "execute" stage, and in an in-order superscalar design, it will most likely be a non-pairable instruction.

In short, you're unlikely to get a real win from this, and it's an awfully specialized opcode. What you really want for this kind of functionality is some simple fixed-function asynchronous DMA engine that can do the C2P on the fly, with none of the problems mentioned above.

But it's very easy to screw that up as well :)
added on the 2010-07-14 20:03:14 by ryg ryg
Since its all fantasy anyway, what you really want is a chunky mode in the AGA chipset. Oh, and 3d transform and texture mapping in hw, too.
added on the 2010-07-14 20:23:34 by xeron xeron
many c64 effects could be 20-30% faster just by having one more index register, or having stuff like lda ($..),x. I have spent many many hours trying to optimize 'one more register please' or 'why doesnt that adressing mode exists' in a statisfactory way.
added on the 2010-07-14 22:11:54 by Oswald Oswald
Something that would also help on the 6502 is to be able to set the flags without having to overwrite a register, the equivalent of the "tst" instruction on 68000.
added on the 2010-07-14 23:05:20 by Dbug Dbug
Chunky mode or faster C2P won't do much difference. A full c2p of 320x200 pixels takes about 20% of a frame. What we really need is faster chip ram. Except that would spoil all the fun of interleaving computations with the chip ram writes. :)
added on the 2010-07-14 23:58:06 by Blueberry Blueberry
PulkoMandy, because you can use EX DE, HL. ;)
added on the 2010-07-15 17:58:08 by MuffinHop MuffinHop
move.l godis,oron
move.l demo,screen
rts
added on the 2010-07-15 23:25:20 by Photon Photon
Real 16-bit adressing in C64... or floating (or some kind of fixed) point in both Z80 and C64...
added on the 2010-07-16 00:22:57 by merry merry
saturated add on 68k would have been nice.
added on the 2010-07-16 03:26:23 by loaderror loaderror
Can someone explain to me why interleaving instructions with writes to video RAM was the fastest method on Amiga? I have similarly slow video ram on my platform and was curious if I could use the same method -- but I need to understand the rationale first.
added on the 2010-07-16 06:21:26 by trixter trixter
trixter: not knowing the platform, sounds like the video memory has some waitstates and the memory bus is busy whenever you write to it, leaving some extra cycles for the cpu.
added on the 2010-07-16 07:53:35 by sol_hsa sol_hsa
Platform is IBM CGA. Adds a single wait state. I don't think anyone's ever done experiments to find out if interleaving is faster than writing to system ram then doing REP MOVSW so it looks like I'll have to check it out myself. I was just curious about the Amiga details.
added on the 2010-07-16 22:11:30 by trixter trixter
@Zerkman:
Quote:

Code:add r0, r0, r0 lsl #8 multiplies r0 by 257 mod 2^32 add r0, r0, #47 adds a prime number




RNG in only one instruction (iirc it was devised by Pervect/Topix):
Code:REM Creates a new random number in m0, affects flags DEFFNrandom(m0):[opt opt%:rsb m0,m0,m0,ror#11:]


But well, i would add nothing to an ARM processors, except people using it instead of those fucking !ntel processors! :(
added on the 2010-07-21 14:35:00 by baah baah
@trixter: 68k cpus have plenty of regs compared to contemporary x86's, though, so interleaving ops is more complicated.. =)
added on the 2010-07-21 15:40:58 by sol_hsa sol_hsa
Since when is rep movsw valid for 68k? :P
added on the 2010-07-21 17:04:58 by ferris ferris
Way to put words in my mouth, dude =)
added on the 2010-07-21 19:55:54 by sol_hsa sol_hsa
For Intel and maybe ARM too:
Code: INLINE java System.out.println("here be the cool code"); END INLINE
added on the 2010-07-21 20:16:45 by waffle waffle
pants_off
added on the 2010-07-21 20:59:34 by nosfe nosfe
POST topic_id, $pointer_to_raw_image_data
added on the 2010-07-21 23:07:26 by Tigrou Tigrou

login

Go to top