Tiny Intro Toolbox Thread

category: code [glöplog]

*FCOMI ST(0), ST(1)
*6 bytes only
:-( I would delete my last post if I could

added on the 2019-04-06 16:28:12 by TomCatAbaddon

Quote:

And as long it runs with a non emulated DOS on a current machine - you have to deal with the fact that it is a totally valid demoscene related coding platform.

(was las's respond to xTr1m and hArDy)

True.
For me fits everything into the DOS platform, which could run during the boot process on every current machine (from boot sector) or at least it runs from USB booted DOS :-)

included:
- mode 13h
- vesa 640x480 truecolor
- PC speaker

I'm not sure about midi music. No hw midi support in current machines.
So I don't feel the midi intro is a DOS product, just a DOSBox product (or maybe a retro DOS product).

added on the 2019-04-06 17:02:42 by TomCatAbaddon

Some recent find for me, might have been overlooked because the instruction fisttp was introduced only with SSE3 for the FPU. If you want to truncate a float for further use you can use this without changing any rounding mode:

Code:

fisttp word[address]
fild word[address]

Even going through memory it's also much faster than using frndint.

added on the 2019-04-06 18:50:52 by Kuemmel

Quote:

Puls uses this - it's about 6 bytes shorter:

Code:for (uint16 i = 0; i < 65536; i++) // automatic wrap { int16 dx = (i * 0xCCCD - Center) >> 8; // automatic modulo to ±32767 int16 dy = (i * 0xCCCD - Center) >> 16; // ±20480 //use dx, dy }

Center is a constant that maps (159.5, 99.5) to (0,0), can be killed using segment magic.
The >>8 and >>16 are also free (pusha + byte addressing into the stack).

Pyrit goes one step further, it uses add instead of mul:

Code:

  mov dx,0xA000-10-20-20-4
  mov es,dx    ; dx:bx = YX:XX = 0x9fca:0

; the visible pixels are A0000..AF9FF, I want X=0 Y=0 in the center
;Each pixel: cx=T dx:bx=YX:XX(init=9fca:0) di=adr(init=-4)

X:inc dx       ; part of "dx:bx += 0x0000CCCD"
X2:
  stosb
  pusha        ; adr:     -18 -16 -14 -12 -10  -8  -6  -4  -2
  fninit       ; stack:    di  si  bp  sp  bx  dx  cx  ax   0
  mov bx,es    ; s16:  pixadr 100 9??  -2  ..X..Y  T result
  mov di,-4 ;di = address of pushed ax

  ...

  add bx,0xCCCD; dx:bx = YXX += 0000CCCD
  jnc X2
  jnz X        ; do 65536 pixels

Tricky

added on the 2019-04-06 19:11:59 by TomCatAbaddon

Both Puls and Pyrit use ADD and both spend about 26 bytes on the loop with different tradeoffs.

Puls has YYXX in memory. The long memory addition doesn't set the correct flags at the end, so [ES:BP+SI] is used for the pixel address and INC BP for the loop test. BP is wrong during the first frame.

Code:


8 push 0x9FCE | pop es ... pop 4 bytes | push es | push bp  ; INIT
5 XY: L: fild word[di] | dec di | jpo L  ; FILD
7 add dword[di],0xCCCD        ; YYXX++
4 inc bp | mov [es:bp+si],al  ; DRAW, ADDR++
2 jnz XY

Pyrit has YYXX in DX:BX. The low-word addition sets the zero flag correctly at the end, so [ES:DI] could be used for the address. But DI needs to be -4 in the inner loop for other reasons.

Code:


5 assume bx=0 | mov dx,0x9FCA | mov es,dx     ; INIT
1 XY: inc dx | XY2:     ; YYXX++ (first part)
11 pusha | mov di,-4 | fild word[di+4-9] | fild word[di+4-8] | popa  ; FILD
1 stosb                 ; DRAW, ADDR++
6 add bx,0xCCCD | jnc XY2  ; YYXX++ (second part)
2 jnz XY

I think Pyrit's version is more versatile. Puls uses PUSHA/POPA anyway to get more registers in the pixel computation, so they're almost free. And having the approximation of Y:X in DH:DL is useful (some intros need only that).

The MUL version can probably be smaller because of more free registers. I don't know yet.

added on the 2019-04-06 23:29:34 by rrrola

Storytime!

Back then, when i was diving into all the floating point stuff (2015/16), i came across this thread - of course - and learned about what we (the sizecoders) coined the "Rrrola constant" (0xCCCD). When p01 was amazed about how short the idea *really* is, i looked into the source of "puls" and was a bit puzzled by that 7 byte ADD - that's not really short is it? In my own tries of making that magic constant work, i reverted to using MUL and aligning manually with opcode of "PUSH <word>" which works the same way as offsetting DX with ES (which i find really funny btw). My most optimized version i can think of now is the 52 byte version of the tunnel included in Neontube 64b. However, i initially was satisfied with 8 bit precision of X, which can look chunky. It's not too hard to align the values on the stack though (and correcting BX again for optimizing the stack accesses) which leads to this 25 byte version (following the previous comment convention):

Code:

4 push 0x9FBA | pop es ; [SI] contains alignment(!) ; INIT
3 X: mov ax,0xCCCD
5 mul di | sub dh,[si] | xchg bx,ax ; DX:BX = YYXX
10 pusha | xor bx,bx | fild word [bx-8] | fild word [bx-9] | popa ; FILD
3 stosb | jmp short X ; DRAW, ADDR

SI is "locked" in that version, but two further bytes can be spared if the stack access happens in other ways than [BX +- signed byte]. With 8 bit precision, it is 22 bytes.

Also, recently, i found that the "Rrrola trick" works in textmode, although you'd have to spend 6 more bytes for conversion and aspect ratio (see here ).

In this 25 byte version, as well as in the two 26 bytes versions there is no explicit framecounter. Back then i found a real nice trick to get a framecounter from MUL, which sets the Carry Flag everytime but twice, with keeping CL as is (0xFF) and reusing the mod byte of "INT 0x10" (ADC) this saves another byte, explained here. In the Pyrit Code a similar idea could save one byte (ADC with Null in reg/mem instead of JNC + INC), but only if there is NULL available in a register or easy accessible memory location.

It's "better" to use a synced timer anyway [0:0x46C], but as of now, that requires more space then the versions above.

added on the 2019-04-07 14:11:15 by HellMood

Another idea is to put the values already ordered on the stack. That "locks" BX, too, but leads to this 24 bytes version

Code:

4 push 0x9FBA | pop es ; [SI] contains alignment(!) ; INIT
3 X: mov ax,0xCCCD
6 mul di | sub dh,[si] | push dx | push ax ; stack = YYXX
8 fild word [bx-4] | fild word [bx-5] | pop dx | pop ax ; FILD
3 stosb | jmp short X ; DRAW, ADDR++

added on the 2019-04-07 20:26:43 by HellMood

Note: both my versions already loop infinitely, so strictly speaking their core is two bytes less (23/22 bytes) while they require additional space later on for example for checking frame ends or continuous sync to the timer, which again needs ~two more bytes (inc reg + jmp short (3) in rrrolas examples vs pop ds + add/sub anim_reg,[0x46c] (5), alternatively the mentioned int10 reusing ADC trick (3-5) which is hard to perform and locks yet another register)

I guess the approaches are not downright comparable bit by bit due to the different constraints. The number of possibilities is amazing =)

added on the 2019-04-07 20:51:38 by HellMood

Historytime!

Which was the first product which used the CCCDh constant?
Was it Pulse from 2009?

Yes, probably.
Before I thought I saw this earlier at HugiCompos or at ChristmasCompos, maybe from Digimind.

But yesterday I did a little research. With the help of TotalCommander I was searching for the string CCCD in *.ASM and for the hexnum CDCC in *.COM. And I didn't find anything from earlier (only some false positive results because of some float constants)

So yes, Rrrola it's from You!
:)

added on the 2019-04-10 00:22:07 by TomCatAbaddon

Quote:

FCMOVB ST(0),ST(1)
FSTP ST(1)

I needed to make a float max() and came up with the same solutions as in your post, but I'm unable to test the FCMOV variant because it requires a Pentium Pro. DOSBox doesn't seem to be able to emulate it, and I don't have a way to run FreeDOS. Is there a good emulator which could run an intro using these instructions?

added on the 2019-04-11 14:01:44 by fizzer

Okay, I just noticed DOSBox-X is able to emulate it. Thanks!

added on the 2019-04-11 14:06:05 by fizzer

May be some common task you want to avoid, but may be can't sometimes...clamp a float (unknown if negative or positive) to an integer byte. I came up with this. It's not sexy, but seems short (15 Bytes). Any shorter ideas ?

Code:

fistp word[si]			;store float in int 
test word[si],0xff00
mov al,byte[si]
jz skip_clamp			;...already in range of 0...255
  stc				;? > 255 => carry = 1
jns skip_min
  clc				;? <   0 => carry = 0
skip_min:
salc				;carry=0=> al=0, carry=1=> al=255
skip_clamp:

added on the 2019-04-11 19:18:49 by Kuemmel

11 bytes:

Code:

fistp word [si]
lodsw
test ah, ah
jz skip_clamp
add ah, ah
cmc
salc
skip_clamp:

added on the 2019-04-11 20:24:49 by frag

Oops, 9 bytes:

Code:

fistp word [si]
lodsw
rol ah, 1
jz skip_clamp
cmc
salc
skip_clamp:

added on the 2019-04-11 20:30:33 by frag

@frag are you sure about the zero flag after rol ah,1 ?

this was my first thought:

Code:

 FISTP WORD [SI]
 LODSW
 TEST AH,AH
 JZ ok
 JNS negative
 STC
negative:
 SALC
ok:

added on the 2019-04-11 20:47:19 by TomCatAbaddon

Nice ! I can't afford to have si incremented, but so still down to 10 Bytes :-)

added on the 2019-04-11 20:51:19 by Kuemmel

but use SHL or ADD AH,AH instead of ROL

added on the 2019-04-11 21:14:10 by TomCatAbaddon

Shit, forgot all the x86 asm lol. Of course rol will not change cl.
shl, add would not work for 0x80xx.
Still pretty sure it can be done in 10 bytes.

Your JNS must be JS by the way.

added on the 2019-04-11 22:03:33 by frag

Kuemmel knows... but 80xxh is very low negative number, so maybe it works for him.

added on the 2019-04-11 22:56:17 by TomCatAbaddon

This thread has been a tremendous help for me to get started on Tiny Intro coding for DOS. It's time I give something back. :)

Include this function in your Mode 13h intro and call it after producing each frame to dump a sequence of BMP images that can be merged in VirtualDub or similar.

Does not assume any specific state on entry. Preserves all registers, segment registers, flags and the palette index. Trashes 1078 bytes just prior to A0000.

If you want more than 9999 frames, you can just increase the number of trailing '0' in the filename (while keeping it at max 8 chars total) and change the "mov cx, 4" accordingly.

Code:

FrameDump:
	pusha
	push ds
	push es
	lahf
	push ax
	cld

	push cs
	pop ds

	; Update filename
	mov bx, Extension
	mov cx, 4
.incloop:
	dec bx
	inc byte [bx]
	cmp byte [bx], '9'
	jle .incdone
	mov byte [bx], '0'
	loop .incloop
	; Exit after 9999 frames
	mov ax, 0x3
	int 0x10
	int 0x20
.incdone:

	push 0xa000-1536/16
	pop es

	; Copy header
	mov si, BMP
	mov di, 1536-(14+40+256*4)
	mov cx, 14+40
	rep movsb

	; Get palette
	mov dx, 0x3c8
	in al, dx
	push ax

	mov dx, 0x3c7
	mov al, 0
	out dx, al

	mov dx, 0x3c9
	mov cx, 256
.palette:
	xor eax, eax
	in al, dx
	shl eax, 8
	in al, dx
	shl eax, 8
	in al, dx
	shl eax, 2
	stosd
	loop .palette

	mov dx, 0x3c8
	pop ax
	out dx, al

	; Create File
	mov cx, 0
	mov dx, Filename
	mov ah, 0x3c
	int 0x21
	jc .done
	push ax

	; Write data
	pop bx
	push bx
	mov cx, 14+40+256*4+320*200
	push es
	pop ds
	mov dx, 1536-(14+40+256*4)
	mov ah, 0x40
	int 0x21

	; Close file
	pop bx
	mov ah, 0x3e
	int 0x21

.done:
	pop ax
	sahf
	pop es
	pop ds
	popa
	ret

Filename:
	db "dump0000"
Extension:
	db ".bmp",0

BMP:
	; File header
	db "BM"
	dd 14+40+256*4+320*200
	dw 0,0
	dd 14+40+256*4

	; Info header
	dd 40, 320, -200 ; Header size, width, height
	dw 1, 8 ; Planes, depth
	dd 0,0,0,0,0,0

Use as you wish. :)

added on the 2019-04-15 12:13:32 by Blueberry

Thanks, Blueberry!
I use this to save a series of paletted, vertically-flipped TGA files. Just "call SCREENSHOT". Uses the stack and 768 bytes of memory right after the intro.

Code:

SCREENSHOT:
  pusha
  pushf
  push ds

  push cs
  pop  ds

;read palette
  mov  ax,255
  mov  di,PALETTE+256*3-1
PALREAD:
  mov  dx,3C7h
  out  dx,al
  inc  dx
  inc  dx
  mov  cx,3
  push ax
PALRGB:
  in   al,dx
  shl  al,2
  mov  [di],al  ; [di] = b*4 g*4 r*4
  dec  di
  loop PALRGB
  pop  ax
  dec  ax
  jns  PALREAD

;increase filename number

  mov  di,FILENAME + (HEADER-FILENAME) - 5
INCNAME:
  inc  byte[di]
  cmp  byte[di],':'
  jb   ENDINCNAME
  mov  byte[di],'0'
  dec  di
  jmp  INCNAME
ENDINCNAME:

;write the TGA file and return

  mov  ah,3Ch  ; create file
  mov  dx,FILENAME
  xor  cx,cx
  int  21h

  xchg ax,bx   ; bx=handle

  mov  ah,40h  ; write header and palette
  mov  dx,HEADER
  mov  cx,18+256*3
  int  21h

  push 0A000h
  pop  ds

  mov  ah,40h  ; write pixels
  cwd          ; dx=0
  mov  cx,320*200
  int  21h

  mov  ah,3Eh  ; close file
  int  21h

  pop  ds
  popf
  popa
  ret

FILENAME db "0000/.tga" ;,0
HEADER   db 0,1,1
         dw 0,256
         db 24
         dw 0,0,320,200
         db 8,00100000b

section .bss align=1
PALETTE: resb 256*3

added on the 2019-04-15 23:54:14 by rrrola

10-byte clamp to unsigned byte. The trick is to test for negatives first.

Code:

  fistp word[si]
  lodsw
  add ah,ah
  jc NEGATIVE ; 8000..FFFF -> FF
  jz OK       ; 0000..00FF
              ; 0100..7FFF -> 00 (carry=0 here)
NEGATIVE:
  salc
OK:

added on the 2019-04-16 00:26:36 by rrrola

Disregard that, I forgot CMC. It's still 11 bytes.

added on the 2019-04-16 00:28:59 by rrrola

Signed clamp is easier. Still 11 bytes, but you can use other multipliers, which might save space elsewhere. Result in AH.

Code:

  fistp word[di]  ; assume di=sp
  pop ax
  imul si  ; si=100h -> dh:dl:ah:al = signbit:ah:al:0
  jnc OK
  mov ah,7Fh
  sub ah,dh   ; ah: FF->80, 00->7F
OK:

Instead of pop|imul, you can also do mov ax,si | imul word[di].

added on the 2019-04-16 01:12:54 by rrrola

Quote:

I use this to save a series of paletted, vertically-flipped TGA files. Just "call SCREENSHOT". Uses the stack and 768 bytes of memory right after the intro.

Ah, I completely forgot about the pushf/popf instructions. :)

This tendency of old, simple image format to store the image bottom-up is quite annoying. Good thing that (an appropriate variation of) the BMP format allows you to put a negative height to flip the image to top-down.

I had an earlier version that split the write into two in order not to trash the memory before the screen area. But it seemed the one-write version was faster (though still quite slow) when writing to a USB stick in FreeDOS. I could be imagining things, though...

added on the 2019-04-16 14:10:35 by Blueberry

pouët.net

Tiny Intro Toolbox Thread

login