4k intro, C vs assembler?
category: code [glöplog]
I have no doubt that when it comes to performance, modern compiler are more effective than manual asm coding.
However, when it comes to size code, I'm not sure that this is the case.
For example, see the code below:
Original compiled code size is 128 bytes and it reduces to 90 bytes after compression.
For me its seems quite big.
The question: would a manually tweaked asm be more much effective?
For reference, this is the generate asm:
However, when it comes to size code, I'm not sure that this is the case.
For example, see the code below:
Code:
static void BuildSkin()
{
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++)
{
int cellValue = *GetCellPtr(x, y);
if (cellValue)
{
CreateFace(x, 1, y, 4, cellValue);
for (int nSide = 0; nSide < 4; nSide++)
{
int CellValue = *GetCellPtr(x + DirMapX[nSide], y + DirMapY[nSide]);
if (!CellValue)
CreateFace(x, 0, y, nSide, cellValue);
}
}
else
CreateFace(x, 0, y, 5, 0);
}
}
Original compiled code size is 128 bytes and it reduces to 90 bytes after compression.
For me its seems quite big.
The question: would a manually tweaked asm be more much effective?
For reference, this is the generate asm:
Code:
00420629 55 PUSH EBP
0042062A 8B EC MOV EBP, ESP
0042062C 51 PUSH ECX
0042062D 83 65 FC 00 AND DWORD [EBP-0x4], 0x0
00420631 53 PUSH EBX
00420632 56 PUSH ESI
00420633 57 PUSH EDI
00420634 33 FF XOR EDI, EDI
00420636 8B 4D FC MOV ECX, [EBP-0x4]
00420639 8B C7 MOV EAX, EDI
0042063B E8 4D FB FF FF CALL ?GetCellPtr@@YGPAHHH@Z
00420640 8B 18 MOV EBX, [EAX]
00420642 33 F6 XOR ESI, ESI
00420644 3B DE CMP EBX, ESI
00420646 74 40 JZ 0x420688
00420648 53 PUSH EBX
00420649 6A 04 PUSH 0x4
0042064B 57 PUSH EDI
0042064C 6A 01 PUSH 0x1
0042064E FF 75 FC PUSH DWORD [EBP-0x4]
00420651 E8 53 01 00 00 CALL ?CreateFace@@YGXHHHEF@Z
00420656 0F BE 8E 84 19 42 00 MOVSX ECX, BYTE [ESI+_DirMapX]
0042065D 0F BE 86 88 19 42 00 MOVSX EAX, BYTE [ESI+_DirMapY]
00420664 03 4D FC ADD ECX, [EBP-0x4]
00420667 03 C7 ADD EAX, EDI
00420669 E8 1F FB FF FF CALL ?GetCellPtr@@YGPAHHH@Z
0042066E 83 38 00 CMP DWORD [EAX], 0x0
00420671 75 0D JNZ 0x420680
00420673 53 PUSH EBX
00420674 56 PUSH ESI
00420675 57 PUSH EDI
00420676 6A 00 PUSH 0x0
00420678 FF 75 FC PUSH DWORD [EBP-0x4]
0042067B E8 29 01 00 00 CALL ?CreateFace@@YGXHHHEF@Z
00420680 46 INC ESI
00420681 83 FE 04 CMP ESI, 0x4
00420684 7C D0 JL 0x420656
00420686 EB 0D JMP 0x420695
00420688 56 PUSH ESI
00420689 6A 05 PUSH 0x5
0042068B 57 PUSH EDI
0042068C 56 PUSH ESI
0042068D FF 75 FC PUSH DWORD [EBP-0x4]
00420690 E8 14 01 00 00 CALL ?CreateFace@@YGXHHHEF@Z
00420695 47 INC EDI
00420696 83 FF 20 CMP EDI, 0x20
00420699 7C 9B JL 0x420636
0042069B FF 45 FC INC DWORD [EBP-0x4]
0042069E 83 7D FC 20 CMP DWORD [EBP-0x4], 0x20
004206A2 7C 90 JL 0x420634
004206A4 5F POP EDI
004206A5 5E POP ESI
004206A6 5B POP EBX
004206A7 C9 LEAVE
004206A8 C3 RET
for 4k most people use C/C++ nowdays, but if you're asm lover and can't get enough of it, by all means make a demo about it, or a 4k in this case.
crinkler should be enough for anyone!
lol
I think I would suggest c++ & inline ASM for things you want to optimize. I know that our coder normally writes anything in c++ and refactors it then in asm by tweaking some patterns...
4k -> nasm/yasm.
las, what do you mean?
ehh. well. trick the compressor? pretty hard. i dunno much about compression far from reading the crinkler reports. and how it does and looks.
but i'd say there's definetely potential to optimize this assembly here. see register edx isn't used at all. the c++ way. make it a "class" pointer with some offset jumptable that makes the first calls smaller. or force a forward declaration to get a positive offset. the thing is zeros compress better. change the function arguments order to make the pattern more entropy friendly repetive. this is just a lil tho. the function got random arguments anyway. also the DirMapX and DirMapY needs to get outta the pattern which is hard.
*my two cheap cents here* :D
but i'd say there's definetely potential to optimize this assembly here. see register edx isn't used at all. the c++ way. make it a "class" pointer with some offset jumptable that makes the first calls smaller. or force a forward declaration to get a positive offset. the thing is zeros compress better. change the function arguments order to make the pattern more entropy friendly repetive. this is just a lil tho. the function got random arguments anyway. also the DirMapX and DirMapY needs to get outta the pattern which is hard.
*my two cheap cents here* :D
*just thought re-reading my post*
how about using edx for the second loop counter? that'd kill that DWORD [EBP-0x4] stack locals. the same negative offset shit. or you decrease ebp to get positive stack offsets for locals. more zeros to compress.
anyway. that compiler did a shit job. an assembly betterization would definetely kill that. period.
how about using edx for the second loop counter? that'd kill that DWORD [EBP-0x4] stack locals. the same negative offset shit. or you decrease ebp to get positive stack offsets for locals. more zeros to compress.
anyway. that compiler did a shit job. an assembly betterization would definetely kill that. period.
Just a quick look, 102 or 96 bytes uncompressed.
EDX can be used for X (previously [EBP-4]). -14 bytes
You don't need the stack frame anymore. -4 bytes.
Use PUSHA/POPA (slower, but smaller). -4 bytes.
EBP is now free, use it for GetCellPtr. -4 bytes.
You can save 6 more bytes by looping down instead of up, but the face order will be reversed.
EDX can be used for X (previously [EBP-4]). -14 bytes
You don't need the stack frame anymore. -4 bytes.
Use PUSHA/POPA (slower, but smaller). -4 bytes.
EBP is now free, use it for GetCellPtr. -4 bytes.
Code:
PUSHA ; -3-2
MOV EBP, ?CreateFace@@YGXHHHEF@Z; +5
XOR EDX, EDX ; -3
__XLOOP:
XOR EDI, EDI
__YLOOP:
MOV ECX, EDX ; -1
MOV EAX, EDI
CALL ?GetCellPtr@@YGPAHHH@Z
MOV EBX, [EAX]
XOR ESI, ESI
CMP EBX, ESI
JE __ELSE
PUSH EBX
PUSH 0x4
PUSH EDI
PUSH 0x1
PUSH EDX ; -2
CALL EBP ; -3
__NSIDELOOP:
MOVSX ECX, BYTE [ESI+_DirMapX]
MOVSX EAX, BYTE [ESI+_DirMapY]
ADD ECX, EDX ; -1
ADD EAX, EDI
CALL ?GetCellPtr@@YGPAHHH@Z
CMP DWORD [EAX], 0x0
JNZ __SKIP
PUSH EBX
PUSH ESI
PUSH EDI
PUSH 0x0
PUSH EDX ; -2
CALL EBP ; -3
__SKIP:
INC ESI
CMP ESI, 0x4
JL __NSIDELOOP
JMP __THEN
__ELSE:
PUSH ESI
PUSH 0x5
PUSH EDI
PUSH ESI
PUSH EDX ; -2
CALL EBP ; -3
__THEN:
INC EDI
CMP EDI, 0x20
JL __YLOOP
INC EDX ; -2
CMP EDX, 0x20 ; -1
JL __XLOOP
POPA ; -2
RET ; -1
You can save 6 more bytes by looping down instead of up, but the face order will be reversed.
Code:
PUSHA ; -3-2
MOV EBP, ?CreateFace@@YGXHHHEF@Z; +5
PUSH 0x1F ; -3
POP EDX ;+1
__XLOOP:
PUSH 0x1F
POP EDI ;+1
__YLOOP:
MOV ECX, EDX ; -1
MOV EAX, EDI
CALL ?GetCellPtr@@YGPAHHH@Z
MOV EBX, [EAX]
OR EBX, EBX
JZ __ELSE
PUSH EBX
PUSH 0x4
PUSH EDI
PUSH 0x1
PUSH EDX ; -2
CALL EBP ; -3
PUSH 3
POP ESI ;+1
__NSIDELOOP:
MOVSX ECX, BYTE [ESI+_DirMapX]
MOVSX EAX, BYTE [ESI+_DirMapY]
ADD ECX, EDX ; -1
ADD EAX, EDI
CALL ?GetCellPtr@@YGPAHHH@Z
CMP DWORD [EAX], 0x0
JNZ __SKIP
PUSH EBX
PUSH ESI
PUSH EDI
PUSH 0x0
PUSH EDX ; -2
CALL EBP ; -3
__SKIP:
DEC ESI ;-3
JNS __NSIDELOOP
JMP __THEN
__ELSE:
PUSH ESI
PUSH 0x5
PUSH EDI
PUSH ESI
PUSH EDX ; -2
CALL EBP ; -3
__THEN:
DEC EDI ;-3
JNS __YLOOP
DEC EDX ;-3 ; -2-1
JNS __XLOOP
POPA ; -2
RET ; -1
Slightly off-topic: I have no experience with 4K, but I see C++ mentioned in this thread I few times. I wonder if it really is a good idea to make use of C++ in 4Ks. Many C++ features tend to blow up code a lot. Shouldn't it be safer to stick to C to minimize code size?
nice one. :D
still needs to compress. argument order is still valid. and the forward offset for GetCellPtr. am i right?
still needs to compress. argument order is still valid. and the forward offset for GetCellPtr. am i right?
zgrg: depends on what you want from "C++". classes, templates, etc have no place in a 4k (with the exception of COM objects like d3d/dsound) really.
for our revision exegfx, i made the odd choice to do the effect itself in C but did the stub/init/glue in ASM for a bunch of reasons:
- i like doing a bit of ASM every now and then, just for the feel of it.
- it was slightly smaller for a window init + data parser/loader code (at least the VS compiler didn't seem to realize the upsides of lods* / stos*)
- with a 4k, you often want to create multiple versions for the final product (different resolutions, windowed, etc) and i just found being able to compile ASM from command-line with different preprocessor definitions works better than it does with C (visual studio is a bit of a bastard with all his environment variables) so I just set up a build script that was able to pop out all the versions from one .bat file.
- with crinkler it's often beneficial to have control over your sections, which works out slightly better if you're running ASM.
note that these are all edge cases and the unpredictability of compression makes everything a "possible" good idea instead of a surefire one.
for our revision exegfx, i made the odd choice to do the effect itself in C but did the stub/init/glue in ASM for a bunch of reasons:
- i like doing a bit of ASM every now and then, just for the feel of it.
- it was slightly smaller for a window init + data parser/loader code (at least the VS compiler didn't seem to realize the upsides of lods* / stos*)
- with a 4k, you often want to create multiple versions for the final product (different resolutions, windowed, etc) and i just found being able to compile ASM from command-line with different preprocessor definitions works better than it does with C (visual studio is a bit of a bastard with all his environment variables) so I just set up a build script that was able to pop out all the versions from one .bat file.
- with crinkler it's often beneficial to have control over your sections, which works out slightly better if you're running ASM.
note that these are all edge cases and the unpredictability of compression makes everything a "possible" good idea instead of a surefire one.
So what does crinkler say about the hand-optimized code? Does it actually compress better?
yumeji: you’re right.
yumeji, thanks for the input!
rrrola, this is very impressive! I will try that tonight and post the compression results.
Just to make sure that I'm on the right track here, assembler inlining is not an option as I want to change the function framing. So I have to nasm/yasm it and link it together, right?
rrrola, this is very impressive! I will try that tonight and post the compression results.
Just to make sure that I'm on the right track here, assembler inlining is not an option as I want to change the function framing. So I have to nasm/yasm it and link it together, right?
Quote:
I have no doubt that when it comes to performance, modern compiler are more effective than manual asm coding.
That's an overrated statement.
Pure asm coding is always best. Not one compiler is smart enough to optimize down to the byte level. Some are quite good yes, but with assembler you can do tricks that not necessarily an compiler can do to save bytes.
rudi, I have seen compilers do crazy things that a real person would not consider. I belive that writing assembler that would fully utilize the internal CPU architecture is REALLY REALLY complicated.
I think it is a matter of knowledge + (abstract) thinking. And fun. And I also guess that there will always be 1 byte that can be saved by optimizing non asm coded stuff by hand.
But maybe not impossibru (=
Quote:
I belive that writing assembler that would fully utilize the internal CPU architecture is REALLY REALLY complicated.
But maybe not impossibru (=
thats because your algorithm changes during assembly-optimization.
I'd not really say "classes don't belong into a 4K". There are some situations when classes can actually help, eg. with addressing.
Case in point (just typed down and compiled with VS2012)
Now, the final machine code for CTest() looks like (shortened):
while the "OMG C++ BLOAT" code of TestClass::Test() looks like this:
Addressing of the variables just got way smaller and clearer. Of course YMMV and after some crinkling the difference will mostly vanish in the depths of context modelling, but I wouldn't discount those C++ features entirely.
Case in point (just typed down and compiled with VS2012)
Code:
class TestClass
{
public:
int a,b;
int Test()
{
return a+b;
}
};
static int c,d;
int CTest()
{
return c+d;
}
Now, the final machine code for CTest() looks like (shortened):
Code:
?CTest@@YAHXZ PROC ; CTest, COMDAT
00000 a1 00 00 00 00 mov eax, DWORD PTR _d
00005 03 05 00 00 0000 add eax, DWORD PTR _c
0000b c3 ret 0
?CTest@@YAHXZ ENDP ; CTest
while the "OMG C++ BLOAT" code of TestClass::Test() looks like this:
Code:
?Test@TestClass@@QAEHXZ PROC ; TestClass::Test, COMDAT
; _this$ = ecx
00000 8b 41 04 mov eax, DWORD PTR [ecx+4]
00003 03 01 add eax, DWORD PTR [ecx]
00005 c3 ret 0
?Test@TestClass@@QAEHXZ ENDP ; TestClass::Test
Addressing of the variables just got way smaller and clearer. Of course YMMV and after some crinkling the difference will mostly vanish in the depths of context modelling, but I wouldn't discount those C++ features entirely.
tlm: there may be a VC optimization option for stubless functions (like -fomit-frame-pointer in GCC).
Some assembly-level optimizations can also be done directly from C: looping down instead of up, indexed loops ending at zero, do-while loops to put the condition at the end, the do-undo trick to get rid of "else" jumps, …
Some assembly-level optimizations can also be done directly from C: looping down instead of up, indexed loops ending at zero, do-while loops to put the condition at the end, the do-undo trick to get rid of "else" jumps, …
TLM, rrrola: It's /Oy (aka "omit frame pointer")
Quote:
Pure asm coding is always best.
Surely you want to keep track of all the pipeline stalls, branch predictions and whatnot? I bet you can get insane from that pretty quickly, while a compiler certainly won't complain about reordering statements for best execution times.
Quote:
rudi, I have seen compilers do crazy things that a real person would not consider.
Well, obviously compilers are not sentient. So these 'crazy things' they do have been considered by humans, and programmed into the optimization logic of the compiler.
But, back to the original topic... I suppose compiled code will not have all that much entropy, since compilers only use a small subset of the total x86 instructionset, and always apply the same code generation routines to the same type of input. Which means that the compiled code will compress very nicely.
I'm not sure how much you can really win by writing everything by hand. Taking the compression into account seems to add yet another level of complexity to optimizing assembly code.
Or perhaps it even means that you should NOT try to be complex, but just keep to the same simple patterns, like a compiler does. Perhaps saving bytes in the original code, because you think you're being 'smart' ends up in lower compression ratios because the compressor has to use a different encoding for that special case than for other parts of the code, where this would not be the case with a compiler.
Discuss...
Or whatever :P