Tiny Intro Toolbox Thread
category: code [glöplog]
*FCOMI ST(0), ST(1)
*6 bytes only
:-( I would delete my last post if I could
*6 bytes only
:-( I would delete my last post if I could
Quote:
(was las's respond to xTr1m and hArDy)And as long it runs with a non emulated DOS on a current machine - you have to deal with the fact that it is a totally valid demoscene related coding platform.
True.
For me fits everything into the DOS platform, which could run during the boot process on every current machine (from boot sector) or at least it runs from USB booted DOS :-)
included:
- mode 13h
- vesa 640x480 truecolor
- PC speaker
I'm not sure about midi music. No hw midi support in current machines.
So I don't feel the midi intro is a DOS product, just a DOSBox product (or maybe a retro DOS product).
Some recent find for me, might have been overlooked because the instruction fisttp was introduced only with SSE3 for the FPU. If you want to truncate a float for further use you can use this without changing any rounding mode:
Even going through memory it's also much faster than using frndint.
Code:
fisttp word[address]
fild word[address]
Even going through memory it's also much faster than using frndint.
Quote:
Puls uses this - it's about 6 bytes shorter:
Code:for (uint16 i = 0; i < 65536; i++) // automatic wrap { int16 dx = (i * 0xCCCD - Center) >> 8; // automatic modulo to ±32767 int16 dy = (i * 0xCCCD - Center) >> 16; // ±20480 //use dx, dy }
Center is a constant that maps (159.5, 99.5) to (0,0), can be killed using segment magic.
The >>8 and >>16 are also free (pusha + byte addressing into the stack).
Pyrit goes one step further, it uses add instead of mul:
Code:
Tricky mov dx,0xA000-10-20-20-4
mov es,dx ; dx:bx = YX:XX = 0x9fca:0
; the visible pixels are A0000..AF9FF, I want X=0 Y=0 in the center
;Each pixel: cx=T dx:bx=YX:XX(init=9fca:0) di=adr(init=-4)
X:inc dx ; part of "dx:bx += 0x0000CCCD"
X2:
stosb
pusha ; adr: -18 -16 -14 -12 -10 -8 -6 -4 -2
fninit ; stack: di si bp sp bx dx cx ax 0
mov bx,es ; s16: pixadr 100 9?? -2 ..X..Y T result
mov di,-4 ;di = address of pushed ax
...
add bx,0xCCCD; dx:bx = YXX += 0000CCCD
jnc X2
jnz X ; do 65536 pixels
Both Puls and Pyrit use ADD and both spend about 26 bytes on the loop with different tradeoffs.
Puls has YYXX in memory. The long memory addition doesn't set the correct flags at the end, so [ES:BP+SI] is used for the pixel address and INC BP for the loop test. BP is wrong during the first frame.
Pyrit has YYXX in DX:BX. The low-word addition sets the zero flag correctly at the end, so [ES:DI] could be used for the address. But DI needs to be -4 in the inner loop for other reasons.
I think Pyrit's version is more versatile. Puls uses PUSHA/POPA anyway to get more registers in the pixel computation, so they're almost free. And having the approximation of Y:X in DH:DL is useful (some intros need only that).
The MUL version can probably be smaller because of more free registers. I don't know yet.
Puls has YYXX in memory. The long memory addition doesn't set the correct flags at the end, so [ES:BP+SI] is used for the pixel address and INC BP for the loop test. BP is wrong during the first frame.
Code:
8 push 0x9FCE | pop es ... pop 4 bytes | push es | push bp ; INIT
5 XY: L: fild word[di] | dec di | jpo L ; FILD
7 add dword[di],0xCCCD ; YYXX++
4 inc bp | mov [es:bp+si],al ; DRAW, ADDR++
2 jnz XY
Pyrit has YYXX in DX:BX. The low-word addition sets the zero flag correctly at the end, so [ES:DI] could be used for the address. But DI needs to be -4 in the inner loop for other reasons.
Code:
5 assume bx=0 | mov dx,0x9FCA | mov es,dx ; INIT
1 XY: inc dx | XY2: ; YYXX++ (first part)
11 pusha | mov di,-4 | fild word[di+4-9] | fild word[di+4-8] | popa ; FILD
1 stosb ; DRAW, ADDR++
6 add bx,0xCCCD | jnc XY2 ; YYXX++ (second part)
2 jnz XY
I think Pyrit's version is more versatile. Puls uses PUSHA/POPA anyway to get more registers in the pixel computation, so they're almost free. And having the approximation of Y:X in DH:DL is useful (some intros need only that).
The MUL version can probably be smaller because of more free registers. I don't know yet.
Storytime!
Back then, when i was diving into all the floating point stuff (2015/16), i came across this thread - of course - and learned about what we (the sizecoders) coined the "Rrrola constant" (0xCCCD). When p01 was amazed about how short the idea *really* is, i looked into the source of "puls" and was a bit puzzled by that 7 byte ADD - that's not really short is it? In my own tries of making that magic constant work, i reverted to using MUL and aligning manually with opcode of "PUSH <word>" which works the same way as offsetting DX with ES (which i find really funny btw). My most optimized version i can think of now is the 52 byte version of the tunnel included in Neontube 64b. However, i initially was satisfied with 8 bit precision of X, which can look chunky. It's not too hard to align the values on the stack though (and correcting BX again for optimizing the stack accesses) which leads to this 25 byte version (following the previous comment convention):
SI is "locked" in that version, but two further bytes can be spared if the stack access happens in other ways than [BX +- signed byte]. With 8 bit precision, it is 22 bytes.
Also, recently, i found that the "Rrrola trick" works in textmode, although you'd have to spend 6 more bytes for conversion and aspect ratio (see here ).
In this 25 byte version, as well as in the two 26 bytes versions there is no explicit framecounter. Back then i found a real nice trick to get a framecounter from MUL, which sets the Carry Flag everytime but twice, with keeping CL as is (0xFF) and reusing the mod byte of "INT 0x10" (ADC) this saves another byte, explained here. In the Pyrit Code a similar idea could save one byte (ADC with Null in reg/mem instead of JNC + INC), but only if there is NULL available in a register or easy accessible memory location.
It's "better" to use a synced timer anyway [0:0x46C], but as of now, that requires more space then the versions above.
Back then, when i was diving into all the floating point stuff (2015/16), i came across this thread - of course - and learned about what we (the sizecoders) coined the "Rrrola constant" (0xCCCD). When p01 was amazed about how short the idea *really* is, i looked into the source of "puls" and was a bit puzzled by that 7 byte ADD - that's not really short is it? In my own tries of making that magic constant work, i reverted to using MUL and aligning manually with opcode of "PUSH <word>" which works the same way as offsetting DX with ES (which i find really funny btw). My most optimized version i can think of now is the 52 byte version of the tunnel included in Neontube 64b. However, i initially was satisfied with 8 bit precision of X, which can look chunky. It's not too hard to align the values on the stack though (and correcting BX again for optimizing the stack accesses) which leads to this 25 byte version (following the previous comment convention):
Code:
4 push 0x9FBA | pop es ; [SI] contains alignment(!) ; INIT
3 X: mov ax,0xCCCD
5 mul di | sub dh,[si] | xchg bx,ax ; DX:BX = YYXX
10 pusha | xor bx,bx | fild word [bx-8] | fild word [bx-9] | popa ; FILD
3 stosb | jmp short X ; DRAW, ADDR
SI is "locked" in that version, but two further bytes can be spared if the stack access happens in other ways than [BX +- signed byte]. With 8 bit precision, it is 22 bytes.
Also, recently, i found that the "Rrrola trick" works in textmode, although you'd have to spend 6 more bytes for conversion and aspect ratio (see here ).
In this 25 byte version, as well as in the two 26 bytes versions there is no explicit framecounter. Back then i found a real nice trick to get a framecounter from MUL, which sets the Carry Flag everytime but twice, with keeping CL as is (0xFF) and reusing the mod byte of "INT 0x10" (ADC) this saves another byte, explained here. In the Pyrit Code a similar idea could save one byte (ADC with Null in reg/mem instead of JNC + INC), but only if there is NULL available in a register or easy accessible memory location.
It's "better" to use a synced timer anyway [0:0x46C], but as of now, that requires more space then the versions above.
Another idea is to put the values already ordered on the stack. That "locks" BX, too, but leads to this 24 bytes version
Code:
4 push 0x9FBA | pop es ; [SI] contains alignment(!) ; INIT
3 X: mov ax,0xCCCD
6 mul di | sub dh,[si] | push dx | push ax ; stack = YYXX
8 fild word [bx-4] | fild word [bx-5] | pop dx | pop ax ; FILD
3 stosb | jmp short X ; DRAW, ADDR++
Note: both my versions already loop infinitely, so strictly speaking their core is two bytes less (23/22 bytes) while they require additional space later on for example for checking frame ends or continuous sync to the timer, which again needs ~two more bytes (inc reg + jmp short (3) in rrrolas examples vs pop ds + add/sub anim_reg,[0x46c] (5), alternatively the mentioned int10 reusing ADC trick (3-5) which is hard to perform and locks yet another register)
I guess the approaches are not downright comparable bit by bit due to the different constraints. The number of possibilities is amazing =)
I guess the approaches are not downright comparable bit by bit due to the different constraints. The number of possibilities is amazing =)
Historytime!
Which was the first product which used the CCCDh constant?
Was it Pulse from 2009?
Yes, probably.
Before I thought I saw this earlier at HugiCompos or at ChristmasCompos, maybe from Digimind.
But yesterday I did a little research. With the help of TotalCommander I was searching for the string CCCD in *.ASM and for the hexnum CDCC in *.COM. And I didn't find anything from earlier (only some false positive results because of some float constants)
So yes, Rrrola it's from You!
:)
Which was the first product which used the CCCDh constant?
Was it Pulse from 2009?
Yes, probably.
Before I thought I saw this earlier at HugiCompos or at ChristmasCompos, maybe from Digimind.
But yesterday I did a little research. With the help of TotalCommander I was searching for the string CCCD in *.ASM and for the hexnum CDCC in *.COM. And I didn't find anything from earlier (only some false positive results because of some float constants)
So yes, Rrrola it's from You!
:)
Quote:
FCMOVB ST(0),ST(1)
FSTP ST(1)
I needed to make a float max() and came up with the same solutions as in your post, but I'm unable to test the FCMOV variant because it requires a Pentium Pro. DOSBox doesn't seem to be able to emulate it, and I don't have a way to run FreeDOS. Is there a good emulator which could run an intro using these instructions?
Okay, I just noticed DOSBox-X is able to emulate it. Thanks!
May be some common task you want to avoid, but may be can't sometimes...clamp a float (unknown if negative or positive) to an integer byte. I came up with this. It's not sexy, but seems short (15 Bytes). Any shorter ideas ?
Code:
fistp word[si] ;store float in int
test word[si],0xff00
mov al,byte[si]
jz skip_clamp ;...already in range of 0...255
stc ;? > 255 => carry = 1
jns skip_min
clc ;? < 0 => carry = 0
skip_min:
salc ;carry=0=> al=0, carry=1=> al=255
skip_clamp:
11 bytes:
Code:
fistp word [si]
lodsw
test ah, ah
jz skip_clamp
add ah, ah
cmc
salc
skip_clamp:
Oops, 9 bytes:
Code:
fistp word [si]
lodsw
rol ah, 1
jz skip_clamp
cmc
salc
skip_clamp:
@frag are you sure about the zero flag after rol ah,1 ?
this was my first thought:
this was my first thought:
Code:
FISTP WORD [SI]
LODSW
TEST AH,AH
JZ ok
JNS negative
STC
negative:
SALC
ok:
Nice ! I can't afford to have si incremented, but so still down to 10 Bytes :-)
but use SHL or ADD AH,AH instead of ROL
Shit, forgot all the x86 asm lol. Of course rol will not change cl.
shl, add would not work for 0x80xx.
Still pretty sure it can be done in 10 bytes.
Your JNS must be JS by the way.
shl, add would not work for 0x80xx.
Still pretty sure it can be done in 10 bytes.
Your JNS must be JS by the way.
Kuemmel knows... but 80xxh is very low negative number, so maybe it works for him.
This thread has been a tremendous help for me to get started on Tiny Intro coding for DOS. It's time I give something back. :)
Include this function in your Mode 13h intro and call it after producing each frame to dump a sequence of BMP images that can be merged in VirtualDub or similar.
Does not assume any specific state on entry. Preserves all registers, segment registers, flags and the palette index. Trashes 1078 bytes just prior to A0000.
If you want more than 9999 frames, you can just increase the number of trailing '0' in the filename (while keeping it at max 8 chars total) and change the "mov cx, 4" accordingly.
Use as you wish. :)
Include this function in your Mode 13h intro and call it after producing each frame to dump a sequence of BMP images that can be merged in VirtualDub or similar.
Does not assume any specific state on entry. Preserves all registers, segment registers, flags and the palette index. Trashes 1078 bytes just prior to A0000.
If you want more than 9999 frames, you can just increase the number of trailing '0' in the filename (while keeping it at max 8 chars total) and change the "mov cx, 4" accordingly.
Code:
FrameDump:
pusha
push ds
push es
lahf
push ax
cld
push cs
pop ds
; Update filename
mov bx, Extension
mov cx, 4
.incloop:
dec bx
inc byte [bx]
cmp byte [bx], '9'
jle .incdone
mov byte [bx], '0'
loop .incloop
; Exit after 9999 frames
mov ax, 0x3
int 0x10
int 0x20
.incdone:
push 0xa000-1536/16
pop es
; Copy header
mov si, BMP
mov di, 1536-(14+40+256*4)
mov cx, 14+40
rep movsb
; Get palette
mov dx, 0x3c8
in al, dx
push ax
mov dx, 0x3c7
mov al, 0
out dx, al
mov dx, 0x3c9
mov cx, 256
.palette:
xor eax, eax
in al, dx
shl eax, 8
in al, dx
shl eax, 8
in al, dx
shl eax, 2
stosd
loop .palette
mov dx, 0x3c8
pop ax
out dx, al
; Create File
mov cx, 0
mov dx, Filename
mov ah, 0x3c
int 0x21
jc .done
push ax
; Write data
pop bx
push bx
mov cx, 14+40+256*4+320*200
push es
pop ds
mov dx, 1536-(14+40+256*4)
mov ah, 0x40
int 0x21
; Close file
pop bx
mov ah, 0x3e
int 0x21
.done:
pop ax
sahf
pop es
pop ds
popa
ret
Filename:
db "dump0000"
Extension:
db ".bmp",0
BMP:
; File header
db "BM"
dd 14+40+256*4+320*200
dw 0,0
dd 14+40+256*4
; Info header
dd 40, 320, -200 ; Header size, width, height
dw 1, 8 ; Planes, depth
dd 0,0,0,0,0,0
Use as you wish. :)
Thanks, Blueberry!
I use this to save a series of paletted, vertically-flipped TGA files. Just "call SCREENSHOT". Uses the stack and 768 bytes of memory right after the intro.
I use this to save a series of paletted, vertically-flipped TGA files. Just "call SCREENSHOT". Uses the stack and 768 bytes of memory right after the intro.
Code:
SCREENSHOT:
pusha
pushf
push ds
push cs
pop ds
;read palette
mov ax,255
mov di,PALETTE+256*3-1
PALREAD:
mov dx,3C7h
out dx,al
inc dx
inc dx
mov cx,3
push ax
PALRGB:
in al,dx
shl al,2
mov [di],al ; [di] = b*4 g*4 r*4
dec di
loop PALRGB
pop ax
dec ax
jns PALREAD
;increase filename number
mov di,FILENAME + (HEADER-FILENAME) - 5
INCNAME:
inc byte[di]
cmp byte[di],':'
jb ENDINCNAME
mov byte[di],'0'
dec di
jmp INCNAME
ENDINCNAME:
;write the TGA file and return
mov ah,3Ch ; create file
mov dx,FILENAME
xor cx,cx
int 21h
xchg ax,bx ; bx=handle
mov ah,40h ; write header and palette
mov dx,HEADER
mov cx,18+256*3
int 21h
push 0A000h
pop ds
mov ah,40h ; write pixels
cwd ; dx=0
mov cx,320*200
int 21h
mov ah,3Eh ; close file
int 21h
pop ds
popf
popa
ret
FILENAME db "0000/.tga" ;,0
HEADER db 0,1,1
dw 0,256
db 24
dw 0,0,320,200
db 8,00100000b
section .bss align=1
PALETTE: resb 256*3
10-byte clamp to unsigned byte. The trick is to test for negatives first.
Code:
fistp word[si]
lodsw
add ah,ah
jc NEGATIVE ; 8000..FFFF -> FF
jz OK ; 0000..00FF
; 0100..7FFF -> 00 (carry=0 here)
NEGATIVE:
salc
OK:
Disregard that, I forgot CMC. It's still 11 bytes.
Signed clamp is easier. Still 11 bytes, but you can use other multipliers, which might save space elsewhere. Result in AH.
Code:
Instead of pop|imul, you can also do mov ax,si | imul word[di]. fistp word[di] ; assume di=sp
pop ax
imul si ; si=100h -> dh:dl:ah:al = signbit:ah:al:0
jnc OK
mov ah,7Fh
sub ah,dh ; ah: FF->80, 00->7F
OK:
Quote:
I use this to save a series of paletted, vertically-flipped TGA files. Just "call SCREENSHOT". Uses the stack and 768 bytes of memory right after the intro.
Ah, I completely forgot about the pushf/popf instructions. :)
This tendency of old, simple image format to store the image bottom-up is quite annoying. Good thing that (an appropriate variation of) the BMP format allows you to put a negative height to flip the image to top-down.
I had an earlier version that split the write into two in order not to trash the memory before the screen area. But it seemed the one-write version was faster (though still quite slow) when writing to a USB stick in FreeDOS. I could be imagining things, though...