[Question] Rotozoomer that will run on low specs
category: code [glöplog]
Hello, I am newschool. I've coded a rotozoomer for DOS that runs at ~9.86 FPS when DOSBox is set to 12010 cycles/ms (i.e. approximately the speed of a 486DX 33MHz). It already:
- computes texture coordinates using increments
- uses inline assembly for the inner loop.
How could the effect be done in a significantly faster way? I reckon at this point it's not about shaving off a few cycles from the inner loop, but about doing some 'magic' that accomplishes the effect in a different way. I know it has been done on even lower specs (like the Amiga 500), but how?
You can see the code here: https://github.com/tazumeki/rotozoom-dos/blob/main/rotozoom.c.
- computes texture coordinates using increments
- uses inline assembly for the inner loop.
How could the effect be done in a significantly faster way? I reckon at this point it's not about shaving off a few cycles from the inner loop, but about doing some 'magic' that accomplishes the effect in a different way. I know it has been done on even lower specs (like the Amiga 500), but how?
You can see the code here: https://github.com/tazumeki/rotozoom-dos/blob/main/rotozoom.c.
Assuming you've already reordered the texels in the texture to be tiled/cache efficient etc, and I also haven't looked at the actual math, but looking at the code (without executing it), I see:
* mov al, [si] + stosb
mosvb?
* dec x
You probably want to do this in a register, like cx. You will need to free cx from the outer loop (push/pop?)
* add bx, dux + add dx, dvx
Again, seems a bad idea to reach out to memory. Something I did in a rotozoom like this at the time was to make the code segment where this routine exists writable. Then I made the code self-modifying, so before entering the loop the code would write to the locations where the add bx and add dx instructions are and write the dux and dyv as literals arguments.
* mov al, [si] + stosb
mosvb?
* dec x
You probably want to do this in a register, like cx. You will need to free cx from the outer loop (push/pop?)
* add bx, dux + add dx, dvx
Again, seems a bad idea to reach out to memory. Something I did in a rotozoom like this at the time was to make the code segment where this routine exists writable. Then I made the code self-modifying, so before entering the loop the code would write to the locations where the add bx and add dx instructions are and write the dux and dyv as literals arguments.
Unroll your loops.
Just some thoughts:
All the shifting- and masking-instructions disappear when you use a 256x256 texture and have the integer part of your UV in the H,L-parts of a register.
You can add-carry the overflows of the fractional parts directly into the integer parts.
This should end up at about 5 instructions per pixel.
Dont't send single bytes to the VGA, unroll to store aligned dwords.
At this point you should see that the code runs significantly slower at ~90deg texture rotations because the data cache doesn't work efficiently there.
You have a 32bit CPU but only use 16bit instructions.
Smaller plattforms unroll speed-code for a quarter scanline or so and fix the interpolation errors in between. This precalced code can be reused for all scanlines because the deltas are constant. That should result in 3 instructions for 2 pixels.
All the shifting- and masking-instructions disappear when you use a 256x256 texture and have the integer part of your UV in the H,L-parts of a register.
You can add-carry the overflows of the fractional parts directly into the integer parts.
This should end up at about 5 instructions per pixel.
Dont't send single bytes to the VGA, unroll to store aligned dwords.
At this point you should see that the code runs significantly slower at ~90deg texture rotations because the data cache doesn't work efficiently there.
You have a 32bit CPU but only use 16bit instructions.
Smaller plattforms unroll speed-code for a quarter scanline or so and fix the interpolation errors in between. This precalced code can be reused for all scanlines because the deltas are constant. That should result in 3 instructions for 2 pixels.
Quote:
At this point you should see that the code runs significantly slower at ~90deg texture rotations because the data cache doesn't work efficiently there.
On this note: you can solve this by creating a 90deg rotated version of your texture and switching between the two depending on your angle. If you zoom far out you may even get some benefits of 2x/4x tiled "mip" versions.
Besides what everyone says, x86 can be tough because of very few registers, as I see in this code the use of memory per pixel for at least three variables and there might be ways to avoid. I have a rotozoomer that runs like 30fps on 386dx where I fit everything in the regs for the inner loop, although I unrolled the whole X to not have to loop over X and free a reg.
Your loop is kinda different, over 64000 but also check for 320 every pixel. If you could rewrite it to double loop over 320 and 200, then if registers are not enough, do the dec [memory] only on outside 200 loop, so inside use cx instead of 320. But there might be a lot things to change so I might be asking much :)
Also if your FP_SHIFT was 8, maybe there was a posibility to avoid the shifts, but after add bx,step and add dx,step you can move use the high bytes b or d directly without shift to construct the tex offset. On my roto I kinda went too far (bad for cache though) with 256x256 texture so that after these, I could easilly construct from high bytes the new 16bit address in a reg and read directly (and so also avoid AND with mask for tiling).
Your loop is kinda different, over 64000 but also check for 320 every pixel. If you could rewrite it to double loop over 320 and 200, then if registers are not enough, do the dec [memory] only on outside 200 loop, so inside use cx instead of 320. But there might be a lot things to change so I might be asking much :)
Also if your FP_SHIFT was 8, maybe there was a posibility to avoid the shifts, but after add bx,step and add dx,step you can move use the high bytes b or d directly without shift to construct the tex offset. On my roto I kinda went too far (bad for cache though) with 256x256 texture so that after these, I could easilly construct from high bytes the new 16bit address in a reg and read directly (and so also avoid AND with mask for tiling).
.jpg)