The ASM instruction you always wanted, but never had?

category: code [glöplog]

Quote:

hm, I guess a nice shuffle instruction would speed up chunky-to-planar routines on the Amiga quite a bit, no?

Depends. There's the memory bandwidth issue Psycho mentioned. (Chipram is *slow*).

Also, shuffle instructions are usually not as awesome as you'd like them to be. This is bumping into another limit - number of register read/write ports (=number of regs that can be read/written in a single cycle). The problem is that any transpose-style operation scatters values from one register with N independent "fields" (whether it be single bits of a pixel like in C2P or multiple bits like when doing a float32 matrix transpose) into N other registers. In a 2-operand ISA with one register write per instruction, you still need 2*log2(N) "shuffle step" style ops and log2(N) moves to do a full transpose. You can work around the ISA issue by having a special opcode that requires a specific data layout in registers, but the port limit is substantially harder (and register file area grows with the square of the number of ports!). You can microcode it and get rid of the moves (which are not a big problem in a superscalar processor because they are independent and can be paired with something else). But that's still 2*log2(N) cycles in the "execute" stage, and in an in-order superscalar design, it will most likely be a non-pairable instruction.

In short, you're unlikely to get a real win from this, and it's an awfully specialized opcode. What you really want for this kind of functionality is some simple fixed-function asynchronous DMA engine that can do the C2P on the fly, with none of the problems mentioned above.

But it's very easy to screw that up as well :)

added on the 2010-07-14 20:03:14 by ryg

Since its all fantasy anyway, what you really want is a chunky mode in the AGA chipset. Oh, and 3d transform and texture mapping in hw, too.

added on the 2010-07-14 20:23:34 by xeron

many c64 effects could be 20-30% faster just by having one more index register, or having stuff like lda ($..),x. I have spent many many hours trying to optimize 'one more register please' or 'why doesnt that adressing mode exists' in a statisfactory way.

added on the 2010-07-14 22:11:54 by Oswald

Something that would also help on the 6502 is to be able to set the flags without having to overwrite a register, the equivalent of the "tst" instruction on 68000.

added on the 2010-07-14 23:05:20 by Dbug

Chunky mode or faster C2P won't do much difference. A full c2p of 320x200 pixels takes about 20% of a frame. What we really need is faster chip ram. Except that would spoil all the fun of interleaving computations with the chip ram writes. :)

added on the 2010-07-14 23:58:06 by Blueberry

PulkoMandy, because you can use EX DE, HL. ;)

added on the 2010-07-15 17:58:08 by MuffinHop

move.l godis,oron
move.l demo,screen
rts

added on the 2010-07-15 23:25:20 by Photon

Real 16-bit adressing in C64... or floating (or some kind of fixed) point in both Z80 and C64...

added on the 2010-07-16 00:22:57 by merry

saturated add on 68k would have been nice.

added on the 2010-07-16 03:26:23 by loaderror

Can someone explain to me why interleaving instructions with writes to video RAM was the fastest method on Amiga? I have similarly slow video ram on my platform and was curious if I could use the same method -- but I need to understand the rationale first.

added on the 2010-07-16 06:21:26 by trixter

trixter: not knowing the platform, sounds like the video memory has some waitstates and the memory bus is busy whenever you write to it, leaving some extra cycles for the cpu.

added on the 2010-07-16 07:53:35 by sol_hsa

Platform is IBM CGA. Adds a single wait state. I don't think anyone's ever done experiments to find out if interleaving is faster than writing to system ram then doing REP MOVSW so it looks like I'll have to check it out myself. I was just curious about the Amiga details.

added on the 2010-07-16 22:11:30 by trixter

@Zerkman:

Quote:

Code:add r0, r0, r0 lsl #8 multiplies r0 by 257 mod 2^32 add r0, r0, #47 adds a prime number

RNG in only one instruction (iirc it was devised by Pervect/Topix):

Code:

REM Creates a new random number in m0, affects flags
DEFFNrandom(m0):[opt opt%:rsb m0,m0,m0,ror#11:]

But well, i would add nothing to an ARM processors, except people using it instead of those fucking !ntel processors! :(

added on the 2010-07-21 14:35:00 by baah

@trixter: 68k cpus have plenty of regs compared to contemporary x86's, though, so interleaving ops is more complicated.. =)

added on the 2010-07-21 15:40:58 by sol_hsa

Since when is rep movsw valid for 68k? :P

added on the 2010-07-21 17:04:58 by ferris

Way to put words in my mouth, dude =)

added on the 2010-07-21 19:55:54 by sol_hsa

For Intel and maybe ARM too:

Code:


INLINE java
System.out.println("here be the cool code");
END INLINE

added on the 2010-07-21 20:16:45 by waffle

pants_off

added on the 2010-07-21 20:59:34 by nosfe

added on the 2010-07-21 21:06:03 by absadhkjsaduoiw1

POST topic_id, $pointer_to_raw_image_data

added on the 2010-07-21 23:07:26 by Tigrou

pouët.net

The ASM instruction you always wanted, but never had?

login