DosBox & SALC, AAM, XLAT
category: code [glöplog]
After being confused for about two evenings, i just want to share the finding that using SALC, AAM and XLAT in DosBox (0.74.3) can heavily reduce performance. I was codegolfing my revision entry, size-optimizing a section of a few lines, when i noticed that the possible maximum of used cycles while maintaining fluent animation and sound went from 400k downto about 130k. It became very apparent when i reduced one effect to just be SALC, getting a black pixel value, which became incredibly slow. I don't know how many other instructions suffer from similar problems. If you experience similar things, and want to keep high performance AND sound (so you would need DosBox), first look for these instructions and replace them.
Anybody else experienced similar problems?
Anybody else experienced similar problems?
This is how they are implemented in "normal" core.
SALC looks so simple that shouldn't be the reason of the slowdown.
Maybe XLAT or AAM...
SALC looks so simple that shouldn't be the reason of the slowdown.
Maybe XLAT or AAM...
Code:
CASE_B(0xd4) /* AAM Ib */
AAM(Fetchb());break;
CASE_B(0xd6) /* SALC */
reg_al = get_CF() ? 0xFF : 0;
break;
CASE_B(0xd7) /* XLAT */
if (TEST_PREFIX_ADDR) {
reg_al=LoadMb(BaseDS+(Bit32u)(reg_ebx+reg_al));
} else {
reg_al=LoadMb(BaseDS+(Bit16u)(reg_bx+reg_al));
}
break;
Thanks for looking that up, that makes things weirder though :D
I did a little OBS composition, to show the effect of replacing SALC with MOV AL,0 side by side (i know that it is not the same, but it's a sizecoding trick to get "black" in AL which seems to be heavily punished) Both DosBoxes run at about 400k cycles (dynamic, pentium_slow), and yes, my Computer can handle multiple dosboxes at once ;)
See here : DosBox / Salc
I removed everything unnecessary which could interfere. Whatever is going on under the hood, using SALC to safe bytes seems to be really bad if lets's say another effect needs a high amount of cycles to run smooth in emulation.
I'm open for any ideas. I shall replace SALC with MOV AL and SBB AL,0 to have a functional equivalent comparison.
Code for convenience:
I did a little OBS composition, to show the effect of replacing SALC with MOV AL,0 side by side (i know that it is not the same, but it's a sizecoding trick to get "black" in AL which seems to be heavily punished) Both DosBoxes run at about 400k cycles (dynamic, pentium_slow), and yes, my Computer can handle multiple dosboxes at once ;)
See here : DosBox / Salc
I removed everything unnecessary which could interfere. Whatever is going on under the hood, using SALC to safe bytes seems to be really bad if lets's say another effect needs a high amount of cycles to run smooth in emulation.
I'm open for any ideas. I shall replace SALC with MOV AL and SBB AL,0 to have a functional equivalent comparison.
Code for convenience:
Code:
mov al,0x13
int 0x10
push 0xa000+10
pop es
mov al,128
out 40h,al
out 40h,al
top:
mov ax,0xcccd
mul di
;salc ; version left
mov al,0 ; version right
xchg dx,ax
add al,ah
sub ax,[fs:0x46c]
stosb
jmp short top
So the functional equivalent (mov al,0 sbb al,0) is as fast as the mov al,0 version ...
You are using the dynarec; the code I pasted was for normal core.
Can you run the two programs with normal core and see if there are differences?
Can you run the two programs with normal core and see if there are differences?
Bingo, they're equally fast, well rather slow, like, horribly slow! Can we conclude that some instructions are not at all optimized by the dynamic core?
I checked the source, and those opcodes (if I'm right) are "missing" from the dynamic core.
I don't know what it does when an opcode is unimplemented.
I don't know what it does when an opcode is unimplemented.
actually, SALC can be replaced by SBB AL, AL, which is 1 byte longer (SBB AL, 0 does work properly for AL=0 only, hence the preceeding MOV AL, 0), the only drawback is SBB changes flags while SALC doesn't (see linky).
it shouldn't be surprizing that all instructions not emulated by dynrec are executed via "normal" code fallback and in result they're horribly slow, so the best you can do is to avoid/repalce them or post a request to emulate them through dynrec :)
it shouldn't be surprizing that all instructions not emulated by dynrec are executed via "normal" code fallback and in result they're horribly slow, so the best you can do is to avoid/repalce them or post a request to emulate them through dynrec :)
@wbc: Thanks, you're right, spot on (i replaced with mov al,0 initially, cause all i wanted was "black" in AL, then added something to behave equally). So i looked up the code myself to find the unoptimized instructions which will be really slow. These are :
DAA, DAS, AAA, AAS, BOUND, ARPL, INS, OUTS, LAHF, CMPS, STOS, LODS, SCAS, AAM, AAD, SALC, XLAT
(from \dosbox\src\cpu\core_dynrec\decoder.h)
I can't quickly figure out how the fallback to the normal core is handled, the switch jumps into "illegal opcode", but i guess that's what happens.
It's worth noting, that using these opcodes is not totally slowing down per se, it happens rather when the used number of unoptimized opcodes "outweigh" the number of optimized ones over large sequences. For example, the code above with SALC above runs *faster* if i randomly insert more optimized opcodes into the loop. Also worth noting that it only happens when you fix the number of cycles to a custom number (the dosbox slows down heavily if it fails to compute the desired number of cycles)
Now i'm really curious about what the actual time consumption of each of these commands are in normal core mode, compared to the time consumption of optimized instructions, but i should rather finish my entry first
DAA, DAS, AAA, AAS, BOUND, ARPL, INS, OUTS, LAHF, CMPS, STOS, LODS, SCAS, AAM, AAD, SALC, XLAT
(from \dosbox\src\cpu\core_dynrec\decoder.h)
I can't quickly figure out how the fallback to the normal core is handled, the switch jumps into "illegal opcode", but i guess that's what happens.
It's worth noting, that using these opcodes is not totally slowing down per se, it happens rather when the used number of unoptimized opcodes "outweigh" the number of optimized ones over large sequences. For example, the code above with SALC above runs *faster* if i randomly insert more optimized opcodes into the loop. Also worth noting that it only happens when you fix the number of cycles to a custom number (the dosbox slows down heavily if it fails to compute the desired number of cycles)
Now i'm really curious about what the actual time consumption of each of these commands are in normal core mode, compared to the time consumption of optimized instructions, but i should rather finish my entry first
(LODS, STOS are in fact optimized, sorry)
Makes sense, the more you have unoptimized opposes between the recompiled ones, the more you have to handle.illegal ones? Which goes probably through a real CPU exception, no?
DOS is still an existing platform. Nonsense, that you have to optimize code for an emulator compo which behaves very differently.
I agree. But for now, if you want music and high performance in a revision 256b compo, you need to care about stuff like this.
Quote:
DOS is still an existing platform. Nonsense, that you have to optimize code for an emulator compo which behaves very differently.
THIS.
Isn't return by value slow in loops? when looking at that C code.
would be slow compared to something like:
but then again, i havent looked at the Dosbox code at all...
But I would guess my assumptions are wrong if all instructions are done this way and I am missing something?
Code:
reg_al=LoadMb(BaseDS+(Bit32u)(reg_ebx+reg_al));
would be slow compared to something like:
Code:
LoadMb(®_al, BaseDS+(Bit32u)(reg_ebx+reg_al));
but then again, i havent looked at the Dosbox code at all...
But I would guess my assumptions are wrong if all instructions are done this way and I am missing something?
basically my point is that using "return" at the end of each function is much slower than not doing it at all.
There are forks of dosbox; dosbox-x is actively maintained by codeholio and if you pointed this out to him on twitter he might be able to fix it (in his fork).
@rudi: C++ does have RVO/NRVO, not sure about plain C
It's returning a byte, most compilers will return that in a register anyway.
Common calling conventions on x86/x64 use eax/rax for integer and pointer return values which fit into the registers bit width.
Also you forgot that passing a memory address for a return value also consumes one of the few parameter slots which may used for register-passed arguments (if not passed via stack) otherwise. And, unless auto-inlined by the compiler, looking up a pointer and writing to it isn´t a no-op either.
Also you forgot that passing a memory address for a return value also consumes one of the few parameter slots which may used for register-passed arguments (if not passed via stack) otherwise. And, unless auto-inlined by the compiler, looking up a pointer and writing to it isn´t a no-op either.
Ok. I get this point. Hopefully the function is not too complex:
Anyways, someone could just throw this into an profiler and figure it out.
Quote:
...you can always try to faciliate RVO and NRVO by returning only one object from all the return paths of your functions, and by limiting the complexity in the structure of your functions.
This will avoid incurring performance costs when returning by value from a function, thus letting you benefit from better code clarity and expressiveness.
Anyways, someone could just throw this into an profiler and figure it out.
I've also had a problem with LOOPZ and LOOPNZ, maybe for the same reasons.
Look for LOOPNZ in Gyroid comments.
Look for LOOPNZ in Gyroid comments.