[Question] Rotozoomer that will run on low specs

category: code [glöplog]

Hello, I am newschool. I've coded a rotozoomer for DOS that runs at ~9.86 FPS when DOSBox is set to 12010 cycles/ms (i.e. approximately the speed of a 486DX 33MHz). It already:
- computes texture coordinates using increments
- uses inline assembly for the inner loop.

How could the effect be done in a significantly faster way? I reckon at this point it's not about shaving off a few cycles from the inner loop, but about doing some 'magic' that accomplishes the effect in a different way. I know it has been done on even lower specs (like the Amiga 500), but how?

You can see the code here: https://github.com/tazumeki/rotozoom-dos/blob/main/rotozoom.c.

added on the 2026-01-29 22:30:12 by tazumeki

Assuming you've already reordered the texels in the texture to be tiled/cache efficient etc, and I also haven't looked at the actual math, but looking at the code (without executing it), I see:

* mov al, [si] + stosb

mosvb?

* dec x

You probably want to do this in a register, like cx. You will need to free cx from the outer loop (push/pop?)

* add bx, dux + add dx, dvx

Again, seems a bad idea to reach out to memory. Something I did in a rotozoom like this at the time was to make the code segment where this routine exists writable. Then I made the code self-modifying, so before entering the loop the code would write to the locations where the add bx and add dx instructions are and write the dux and dyv as literals arguments.

added on the 2026-01-29 23:25:08 by iq

Unroll your loops.

added on the 2026-01-30 00:29:07 by Gargaj

Just some thoughts:
All the shifting- and masking-instructions disappear when you use a 256x256 texture and have the integer part of your UV in the H,L-parts of a register.
You can add-carry the overflows of the fractional parts directly into the integer parts.
This should end up at about 5 instructions per pixel.
Dont't send single bytes to the VGA, unroll to store aligned dwords.
At this point you should see that the code runs significantly slower at ~90deg texture rotations because the data cache doesn't work efficiently there.
You have a 32bit CPU but only use 16bit instructions.
Smaller plattforms unroll speed-code for a quarter scanline or so and fix the interpolation errors in between. This precalced code can be reused for all scanlines because the deltas are constant. That should result in 3 instructions for 2 pixels.

added on the 2026-01-30 00:38:40 by hfr

Quote:

At this point you should see that the code runs significantly slower at ~90deg texture rotations because the data cache doesn't work efficiently there.

On this note: you can solve this by creating a 90deg rotated version of your texture and switching between the two depending on your angle. If you zoom far out you may even get some benefits of 2x/4x tiled "mip" versions.

added on the 2026-01-30 01:42:21 by Gargaj

Besides what everyone says, x86 can be tough because of very few registers, as I see in this code the use of memory per pixel for at least three variables and there might be ways to avoid. I have a rotozoomer that runs like 30fps on 386dx where I fit everything in the regs for the inner loop, although I unrolled the whole X to not have to loop over X and free a reg.

Your loop is kinda different, over 64000 but also check for 320 every pixel. If you could rewrite it to double loop over 320 and 200, then if registers are not enough, do the dec [memory] only on outside 200 loop, so inside use cx instead of 320. But there might be a lot things to change so I might be asking much :)

Also if your FP_SHIFT was 8, maybe there was a posibility to avoid the shifts, but after add bx,step and add dx,step you can move use the high bytes b or d directly without shift to construct the tex offset. On my roto I kinda went too far (bad for cache though) with 256x256 texture so that after these, I could easilly construct from high bytes the new 16bit address in a reg and read directly (and so also avoid AND with mask for tiling).

added on the 2026-01-30 02:11:28 by Optimus

Quote:

do the dec [memory] only on outside 200 loop, so inside use cx instead of 320.

Or unroll the innerloop x2 to get down to 160 and use "dec cl" for inner and "dec ch" for the outer loop, voila :)

added on the 2026-01-30 08:40:13 by hfr

Are there advantages to not using 32-bit protected mode on a 486?

added on the 2026-01-30 10:17:51 by absence

Oh, the folks above have already pointed out most of the important things :)

Yeah, if it’s not critical for you, switch to using a 256×256 texture, and you’ll be surprised how much simpler your inner loop becomes (hint: you won’t need any shr or and instructions to calculate the pixel address in the texture).

And yes, of course, there should be no memory variable accesses inside the inner loop, registers only. Yes, there are never enough registers, but in this case it’s a solvable problem.

Something more specific:

Instead of stosb, it’s better to use:
mov es:[di], al
inc di
Oddly enough, this is faster on a 486.

And if you unroll the inner loop, you’ll end up with just:
mov es:[di], ax
add di, 2

To solve the uneven FPS issue (depending on the angle) that was mentioned earlier, I split the rendering into 16×16 tiles. It’s a bit of a hassle, but the resulting FPS becomes much more consistent.

Happy coding :)

added on the 2026-01-30 10:38:15 by bitl

P.S. And yes, DOSBox of course doesn’t reproduce all the nuances, such as how the CPU cache behaves. Some things that run faster in DOSBox actually run slower on real hardware, and vice versa.

added on the 2026-01-30 10:44:41 by bitl

If you still think there’s some kind of magic involved, take a look at the rotzoomer code for the 8086 (8 MHz): https://github.com/mills32/CUTE_DEMO-MS-DOS/blob/main/src/rotozoom.asm - check out this on [youtube]

but this makes sense specifically for 8086-80286.

added on the 2026-01-30 11:46:45 by bitl

Quote:

On this note: you can solve this by creating a 90deg rotated version of your texture and switching between the two depending on your angle

Just swizzle the texture! Googling for pasroto.zip should give you an era-appropriate 1996 explanation of how to do that.

added on the 2026-01-30 11:56:26 by sagacity

Since these two approaches where just mentioned:
If you fill your screen in 16x16 tiles or sample your texture from 16x16 blocks gives very similar results.
I remember back in the days I found the latter one a bit surprising as a 16x16 source block simply consists of 16 (seamingly) independent cachelines.
But as the 486 uses a 4-way associative cache, it makes a big difference from where you fetch your data. In the 90deg-case all the cachelines from an unswizzled texture are 256 bytes apart and all fall into the same cache bin, basically reducing your cache size to 4x16 bytes for this block. But 16x16 bytes from a linear location all fit into the cache at once.

added on the 2026-01-30 13:24:53 by hfr

Thanks for your nice replies and the optimization advice. Turns out the 'magic' I was looking for was precalculation. When I look at mills32's rotozoom.asm, I see he's using tables with 180 frames worth of precalculated pointer increments. I don't understand the math yet TBH, but I guess this answers my question as to how the effect is realized on much slower machines (like an 8 MHz 8086 in this case).

added on the 2026-01-31 15:15:22 by tazumeki

Quote:

Thanks for your nice replies and the optimization advice. Turns out the 'magic' I was looking for was precalculation. When I look at mills32's rotozoom.asm, I see he's using tables with 180 frames worth of precalculated pointer increments. I don't understand the math yet TBH, but I guess this answers my question as to how the effect is realized on much slower machines (like an 8 MHz 8086 in this case).

Still, I think you can achieve 70 FPS on a 486 at 33 MHz even without a precalculated table. From a coding practice perspective, that’s actually more interesting.

Post your improved versions of the rotozoomer here and we’ll discuss them :)

I’d be curious too - what tricks are used on the Amiga or the C64? I suspect there’s some stuff that’s way cooler than the PC rotozoomer implementations.

added on the 2026-01-31 16:16:46 by bitl

Agreed with @bitl, there are so many things to learn from all comments about optimizing the existing code and get an idea of how to improve things even for other effects.

On the other hand, I was thinking about the implementations like the one in 8088 posted. It reminds me something I presume they do in older platforms and never tried myself. I remember some CPC rotozoomers where the texel precision is so broken when zoom in, that quite possibly they do similar things. Although in this 8088 maybe there is different data to alleviate this. My thought might differ, it's that your rotozoomer animation is brief in your demo, so someone could make crude unrolled code for a single line as the step is the same. The unroll code will read and inc either U or V or wait if it's in same texel. Hardcoded without any fixed point additions or anything, just like people hardcode wolfenstein column stretches for different scale levels. But only for the few frames of your effect, as hardcoding every possibly zoom/rotate level would be too much. Still, maybe too much precalcs for something that can relatively easily run fast enough on 386/486. Would try with one single scale/rotate value on older to see the difference in performance though.

added on the 2026-01-31 16:36:43 by Optimus

So… I actually went all-in on this ;)

Is it really possible to make a rotozoomer that runs stably at 70 FPS on a 486-33 MHz? Since I told the topic starter that it is possible.

I made 3 variants:

1. Standart rendering (lines by lines), plus an additional texture rotated by 90 degree (to fight CPU cache misses)

2. Rendering in 16x16 tiles

3. Same tiles 16x16, but with precomputed offsets for 16 pixels of a line (applied to the whole frame, low precision, but renders much faster)

Considering the topic-starter measured performance in DOSBox, I want to point out that DOSBox does not emulate CPU cache behavior at all. There it makes no difference whether memory is read sequentially or randomly.

Because of that, the result on a real machine can differ a lot from what you expect. For example, I usually set cycles=15000 in DOSBox to roughly match my PC 486DX2-66 MHz, not super precise, but within +-10 FPS for other effects. Rotozoomer (and not only the ones I coded) run much faster on real hardware than in DOSBox (at the same 15000 cycles). In fact, about twice as fast, but only if you successfully apply tricks to avoid cache misses.

if anyone’s interested, here’s what I got:
(On my 486, with fixed Scale 1:1)

TWO_TXTR.EXE (two textures) — 114 FPS

TAIL_HIG.EXE (16x16 tiles) — 108 FPS

TAIL_LOW.EXE (tiles + low precision) — 167 FPS

With a varying scale in the range 0.7 - 1.7, performance drops by about 6 - 8 FPS.

In DOSBox, for routines types 1 and 2 these numbers are only reached at about 37000 cycles, but at the same TAIL_LOW.EXE shoots up to 300 FPS. So measuring rotozoomer performance in DOSBox is… not a great idea :)

If anyone’s curious, you can test it or look at the sources:

http://chiptown.ru/stuff/rotozoom.zip

Sorry for the Turbo Pascal, but of course the main routine is written in assembly.

And if someone shows me faster routines, I’d appreciate it :)

added on the 2026-02-17 17:40:19 by bitl

Top tip: pasroto.zip

added on the 2026-02-18 14:55:39 by superplek

Quote:

Top tip: pasroto.zip

This has already been mentioned here.

But still, the Pascal/Cubik Team routine is about the same speed as my two versions (the third one is even faster, but with some loss of accuracy).

And yet it’s optimized for a top-end 486 and Pentium. What about a 386? Or a 286? :)

added on the 2026-02-18 15:58:47 by bitl

Oh I apologise for not reading each and every answer in detail.

I just mentioned it because the principle discussed in there is pretty paramount in graphics and performance programming alike.

added on the 2026-02-18 17:30:06 by superplek

Tried on Pocket 386 (386sx at 40mhz, memset 64k to vga in my benchmark max 60fps, compared to tseng labs at 80fps)

TWO_TXTR or TAIL_HIGH: 20fps
TAIL_LOW: 27fps

For comparison. My old simple rotozoomer test on this hardware, per pixel 320x200, full assembly code and trying to use regs as much possible and always use 256x256 texture to not have to AND for tiling, 23fps
But of course on 486 fps variations because of cache.

p.s. I am curious about the shifted tiles, will check it later, not sure how it works yet. Is it preshifted pixel and copy entire word instead of byte or something, hmm. I'll check the code anyway.

added on the 2026-02-19 12:46:50 by Optimus

Quote:

Tried on Pocket 386 (386sx at 40mhz, memset 64k to vga in my benchmark max 60fps, compared to tseng labs at 80fps)

TWO_TXTR or TAIL_HIGH: 20fps
TAIL_LOW: 27fps

Thanks for the tests!
Actually, it’s even faster than I expected for a 386.

Most likely, if I optimize the code specifically for the 386 a bit more, I could squeeze out a couple more fps (my 386DX/40 MHz is currently somewhat disassembled).

In any case, if it were 160×100 resolution like in Second Reality, it would probably run at around 70 FPS. That’s encouraging :)

Quote:

For comparison. My old simple rotozoomer test on this hardware, per pixel 320x200, full assembly code and trying to use regs as much possible and always use 256x256 texture to not have to AND for tiling, 23fps
But of course on 486 fps variations because of cache.

Is the source code secret? :)
Did you use lookup tables for sin/cos?

I didn’t bother with that, since on a 486 computing a sine/cosine pair per frame isn’t significant (even without FPU). But on 386 it probably does affect the FPS.

Quote:

p.s. I am curious about the shifted tiles, will check it later, not sure how it works yet. Is it preshifted pixel and copy entire word instead of byte or something, hmm. I'll check the code anyway.

Hmm...

added on the 2026-02-19 15:14:23 by bitl

Quote:

copy entire word instead of byte

you've given me an idea :) Indeed... this could be added to one of the variations of my routine.

added on the 2026-02-19 15:24:42 by bitl

I don't have comment up yet, but the main asm code was that, with a bit of unroll to win one reg and trying to use regs as much as possible, collecting sample in al and ah to write once for two pixels, using bp also as extra register.

Outside I do sin/cos with LUTs. But I'd guess it would be minimal loss even if I did float.

Also, I notice my rotozoomer has low precision when zooming close because of 8:8 fixed point. I was looking at the code of Second Reality where it's smoother. Code doesn't look much different than mine, but after they do add reg16,reg16 for interpolation step, they later do an adc reg8,reg8 if I recall. 24bit, 8:16 for smoother precision, I might try it at some point. Register usage seemed tight too. Second reality rotozoomer while impressive seems to be running smoothly on 386sx too, probably because they dropped to 160x100

Code:


fxRotoRunAsm320x_:
		push bp
		mov bp,sp
		pusha
		push ds

		push word 08000h
		pop ds

		mov es,[bp+6]

		mov si,bx
		mov bp,cx
		mov cx,ax

		xor di,di

		mov ah,100
		roty320:
			push ax
			push cx
			push dx

%rep 160
			mov bl,ch
			mov bh,dh
			mov al,[bx]
			add cx,si
			add dx,bp
			mov bl,ch
			mov bh,dh
			mov ah,[bx]
			add cx,si
			add dx,bp
			stosw
%endrep

			pop dx
			pop cx

			add cx,bp
			add cx,bp
			sub dx,si
			sub dx,si

			pop ax
			dec ah

		jz fthis320
		jmp roty320
		fthis320:

	pop ds
	popa
	pop bp

	retf 0x0002

added on the 2026-02-20 11:25:46 by Optimus

mov ah,100 for Y loop count?
and twice add cx,bp:add cx,bp?

That might be the 320x100 test my mistake. But 320x200 is similar code anyway.

I tried 160x100, 160x50,80x50 and others on 286. They are smoother.
Even the 320x200 but on beefy 286 at 20mhz, 0 wait states on BIOS for memory, maxed ISA Tseng Labs, did reach 17fps.

added on the 2026-02-20 11:30:15 by Optimus

pouët.net

[Question] Rotozoomer that will run on low specs

login