4k intro, C vs assembler?

category: code [glöplog]

I have no doubt that when it comes to performance, modern compiler are more effective than manual asm coding.
However, when it comes to size code, I'm not sure that this is the case.

For example, see the code below:

Code:

static void BuildSkin()
{
	for (int x = 0; x < width; x++)
		for (int y = 0; y < height; y++)
		{
			int cellValue = *GetCellPtr(x, y);
			if (cellValue)
			{
				CreateFace(x, 1, y, 4, cellValue);

				for (int nSide = 0; nSide < 4; nSide++) 
				{	
					int CellValue = *GetCellPtr(x + DirMapX[nSide], y + DirMapY[nSide]);
					if (!CellValue)
						CreateFace(x, 0, y, nSide, cellValue);
				}
			}
			else
				CreateFace(x, 0, y, 5, 0);
		}
}

Original compiled code size is 128 bytes and it reduces to 90 bytes after compression.
For me its seems quite big.

The question: would a manually tweaked asm be more much effective?

For reference, this is the generate asm:

Code:

 00420629  55                     PUSH      EBP
 0042062A  8B EC                  MOV       EBP, ESP
 0042062C  51                     PUSH      ECX
 0042062D  83 65 FC 00            AND       DWORD [EBP-0x4], 0x0
 00420631  53                     PUSH      EBX
 00420632  56                     PUSH      ESI
 00420633  57                     PUSH      EDI
 00420634  33 FF                  XOR       EDI, EDI
 00420636  8B 4D FC               MOV       ECX, [EBP-0x4]
 00420639  8B C7                  MOV       EAX, EDI
 0042063B  E8 4D FB FF FF         CALL      ?GetCellPtr@@YGPAHHH@Z
 00420640  8B 18                  MOV       EBX, [EAX]
 00420642  33 F6                  XOR       ESI, ESI
 00420644  3B DE                  CMP       EBX, ESI
 00420646  74 40                  JZ        0x420688
 00420648  53                     PUSH      EBX
 00420649  6A 04                  PUSH      0x4
 0042064B  57                     PUSH      EDI
 0042064C  6A 01                  PUSH      0x1
 0042064E  FF 75 FC               PUSH      DWORD [EBP-0x4]
 00420651  E8 53 01 00 00         CALL      ?CreateFace@@YGXHHHEF@Z
 00420656  0F BE 8E 84 19 42 00   MOVSX     ECX, BYTE [ESI+_DirMapX]
 0042065D  0F BE 86 88 19 42 00   MOVSX     EAX, BYTE [ESI+_DirMapY]
 00420664  03 4D FC               ADD       ECX, [EBP-0x4]
 00420667  03 C7                  ADD       EAX, EDI
 00420669  E8 1F FB FF FF         CALL      ?GetCellPtr@@YGPAHHH@Z
 0042066E  83 38 00               CMP       DWORD [EAX], 0x0
 00420671  75 0D                  JNZ       0x420680
 00420673  53                     PUSH      EBX
 00420674  56                     PUSH      ESI
 00420675  57                     PUSH      EDI
 00420676  6A 00                  PUSH      0x0
 00420678  FF 75 FC               PUSH      DWORD [EBP-0x4]
 0042067B  E8 29 01 00 00         CALL      ?CreateFace@@YGXHHHEF@Z
 00420680  46                     INC       ESI
 00420681  83 FE 04               CMP       ESI, 0x4
 00420684  7C D0                  JL        0x420656
 00420686  EB 0D                  JMP       0x420695
 00420688  56                     PUSH      ESI
 00420689  6A 05                  PUSH      0x5
 0042068B  57                     PUSH      EDI
 0042068C  56                     PUSH      ESI
 0042068D  FF 75 FC               PUSH      DWORD [EBP-0x4]
 00420690  E8 14 01 00 00         CALL      ?CreateFace@@YGXHHHEF@Z
 00420695  47                     INC       EDI
 00420696  83 FF 20               CMP       EDI, 0x20
 00420699  7C 9B                  JL        0x420636
 0042069B  FF 45 FC               INC       DWORD [EBP-0x4]
 0042069E  83 7D FC 20            CMP       DWORD [EBP-0x4], 0x20
 004206A2  7C 90                  JL        0x420634
 004206A4  5F                     POP       EDI
 004206A5  5E                     POP       ESI
 004206A6  5B                     POP       EBX
 004206A7  C9                     LEAVE     
 004206A8  C3                     RET

added on the 2013-04-06 10:26:51 by TLM

for 4k most people use C/C++ nowdays, but if you're asm lover and can't get enough of it, by all means make a demo about it, or a 4k in this case.

added on the 2013-04-06 10:38:03 by psenough

crinkler should be enough for anyone!

added on the 2013-04-06 10:38:56 by psenough

lol

added on the 2013-04-06 10:41:17 by TLM

I think I would suggest c++ & inline ASM for things you want to optimize. I know that our coder normally writes anything in c++ and refactors it then in asm by tweaking some patterns...

added on the 2013-04-06 10:49:05 by FeN

4k -> nasm/yasm.

added on the 2013-04-06 11:13:57 by las

las, what do you mean?

added on the 2013-04-06 11:16:43 by TLM

ehh. well. trick the compressor? pretty hard. i dunno much about compression far from reading the crinkler reports. and how it does and looks.

but i'd say there's definetely potential to optimize this assembly here. see register edx isn't used at all. the c++ way. make it a "class" pointer with some offset jumptable that makes the first calls smaller. or force a forward declaration to get a positive offset. the thing is zeros compress better. change the function arguments order to make the pattern more entropy friendly repetive. this is just a lil tho. the function got random arguments anyway. also the DirMapX and DirMapY needs to get outta the pattern which is hard.

*my two cheap cents here* :D

added on the 2013-04-06 11:31:35 by yumeji

*just thought re-reading my post*

how about using edx for the second loop counter? that'd kill that DWORD [EBP-0x4] stack locals. the same negative offset shit. or you decrease ebp to get positive stack offsets for locals. more zeros to compress.

anyway. that compiler did a shit job. an assembly betterization would definetely kill that. period.

added on the 2013-04-06 12:35:36 by yumeji

Just a quick look, 102 or 96 bytes uncompressed.

EDX can be used for X (previously [EBP-4]). -14 bytes
You don't need the stack frame anymore. -4 bytes.
Use PUSHA/POPA (slower, but smaller). -4 bytes.
EBP is now free, use it for GetCellPtr. -4 bytes.

Code:


   PUSHA                                 ; -3-2
   MOV       EBP, ?CreateFace@@YGXHHHEF@Z; +5
   XOR       EDX, EDX                    ; -3
 
__XLOOP:
   XOR       EDI, EDI
__YLOOP:
   MOV       ECX, EDX                    ; -1
   MOV       EAX, EDI
   CALL      ?GetCellPtr@@YGPAHHH@Z
 
   MOV       EBX, [EAX]
   XOR       ESI, ESI
   CMP       EBX, ESI
   JE        __ELSE
 
   PUSH      EBX
   PUSH      0x4
   PUSH      EDI
   PUSH      0x1
   PUSH      EDX                         ; -2
   CALL      EBP                         ; -3

__NSIDELOOP:
   MOVSX     ECX, BYTE [ESI+_DirMapX]
   MOVSX     EAX, BYTE [ESI+_DirMapY]
   ADD       ECX, EDX                    ; -1
   ADD       EAX, EDI
   CALL      ?GetCellPtr@@YGPAHHH@Z
 
   CMP       DWORD [EAX], 0x0
   JNZ       __SKIP
 
   PUSH      EBX
   PUSH      ESI
   PUSH      EDI
   PUSH      0x0
   PUSH      EDX                         ; -2
   CALL      EBP                         ; -3

__SKIP:
   INC       ESI
   CMP       ESI, 0x4
   JL        __NSIDELOOP
   JMP       __THEN

__ELSE:
   PUSH      ESI
   PUSH      0x5
   PUSH      EDI
   PUSH      ESI
   PUSH      EDX                         ; -2
   CALL      EBP                         ; -3

__THEN:
   INC       EDI
   CMP       EDI, 0x20
   JL        __YLOOP 
   INC       EDX                         ; -2
   CMP       EDX, 0x20                   ; -1
   JL        __XLOOP
 
   POPA                                  ; -2
 
   RET                                   ; -1

You can save 6 more bytes by looping down instead of up, but the face order will be reversed.

Code:


   PUSHA                                 ; -3-2
   MOV       EBP, ?CreateFace@@YGXHHHEF@Z; +5

   PUSH   0x1F                           ; -3
   POP    EDX                        ;+1
__XLOOP:

   PUSH   0x1F
   POP    EDI                        ;+1
__YLOOP:

   MOV       ECX, EDX                    ; -1
   MOV       EAX, EDI
   CALL      ?GetCellPtr@@YGPAHHH@Z
 
   MOV       EBX, [EAX]
   OR        EBX, EBX
   JZ        __ELSE
 
   PUSH      EBX
   PUSH      0x4
   PUSH      EDI
   PUSH      0x1
   PUSH      EDX                         ; -2
   CALL      EBP                         ; -3

   PUSH 3
   POP ESI                          ;+1
   
__NSIDELOOP:
   MOVSX     ECX, BYTE [ESI+_DirMapX]
   MOVSX     EAX, BYTE [ESI+_DirMapY]
   ADD       ECX, EDX                    ; -1
   ADD       EAX, EDI
   CALL      ?GetCellPtr@@YGPAHHH@Z
 
   CMP       DWORD [EAX], 0x0
   JNZ       __SKIP
 
   PUSH      EBX
   PUSH      ESI
   PUSH      EDI
   PUSH      0x0
   PUSH      EDX                         ; -2
   CALL      EBP                         ; -3

__SKIP:
   DEC       ESI                    ;-3
   JNS       __NSIDELOOP
   JMP       __THEN

__ELSE:
   PUSH      ESI
   PUSH      0x5
   PUSH      EDI
   PUSH      ESI
   PUSH      EDX                         ; -2
   CALL      EBP                         ; -3

__THEN:
   DEC       EDI                    ;-3
   JNS       __YLOOP 
   DEC       EDX                    ;-3  ; -2-1
   JNS       __XLOOP
 
   POPA                                  ; -2
 
   RET                                   ; -1

added on the 2013-04-06 12:51:41 by rrrola

Slightly off-topic: I have no experience with 4K, but I see C++ mentioned in this thread I few times. I wonder if it really is a good idea to make use of C++ in 4Ks. Many C++ features tend to blow up code a lot. Shouldn't it be safer to stick to C to minimize code size?

added on the 2013-04-06 12:58:06 by zgrg

nice one. :D

still needs to compress. argument order is still valid. and the forward offset for GetCellPtr. am i right?

added on the 2013-04-06 13:10:13 by yumeji

zgrg: depends on what you want from "C++". classes, templates, etc have no place in a 4k (with the exception of COM objects like d3d/dsound) really.

for our revision exegfx, i made the odd choice to do the effect itself in C but did the stub/init/glue in ASM for a bunch of reasons:
- i like doing a bit of ASM every now and then, just for the feel of it.
- it was slightly smaller for a window init + data parser/loader code (at least the VS compiler didn't seem to realize the upsides of lods* / stos*)
- with a 4k, you often want to create multiple versions for the final product (different resolutions, windowed, etc) and i just found being able to compile ASM from command-line with different preprocessor definitions works better than it does with C (visual studio is a bit of a bastard with all his environment variables) so I just set up a build script that was able to pop out all the versions from one .bat file.
- with crinkler it's often beneficial to have control over your sections, which works out slightly better if you're running ASM.

note that these are all edge cases and the unpredictability of compression makes everything a "possible" good idea instead of a surefire one.

added on the 2013-04-06 13:45:43 by Gargaj

So what does crinkler say about the hand-optimized code? Does it actually compress better?

added on the 2013-04-06 13:46:08 by trc_wm

yumeji: you’re right.

added on the 2013-04-06 13:47:04 by rrrola

yumeji, thanks for the input!

rrrola, this is very impressive! I will try that tonight and post the compression results.

Just to make sure that I'm on the right track here, assembler inlining is not an option as I want to change the function framing. So I have to nasm/yasm it and link it together, right?

added on the 2013-04-06 13:54:19 by TLM

Quote:

I have no doubt that when it comes to performance, modern compiler are more effective than manual asm coding.

That's an overrated statement.
Pure asm coding is always best. Not one compiler is smart enough to optimize down to the byte level. Some are quite good yes, but with assembler you can do tricks that not necessarily an compiler can do to save bytes.

added on the 2013-04-06 14:00:07 by rudi

rudi, I have seen compilers do crazy things that a real person would not consider. I belive that writing assembler that would fully utilize the internal CPU architecture is REALLY REALLY complicated.

added on the 2013-04-06 14:04:21 by TLM

I think it is a matter of knowledge + (abstract) thinking. And fun. And I also guess that there will always be 1 byte that can be saved by optimizing non asm coded stuff by hand.

Quote:

I belive that writing assembler that would fully utilize the internal CPU architecture is REALLY REALLY complicated.

But maybe not impossibru (=

added on the 2013-04-06 14:22:24 by sensenstahl

thats because your algorithm changes during assembly-optimization.

added on the 2013-04-06 14:49:20 by rudi

I'd not really say "classes don't belong into a 4K". There are some situations when classes can actually help, eg. with addressing.

Case in point (just typed down and compiled with VS2012)

Code:


class TestClass
{
public:

  int a,b;

  int Test()
  {
    return a+b;
  }
};

static int c,d;

int CTest()
{
  return c+d;
}

Now, the final machine code for CTest() looks like (shortened):

Code:


?CTest@@YAHXZ PROC					; CTest, COMDAT
  00000	a1 00 00 00 00	 mov	 eax, DWORD PTR _d
  00005	03 05 00 00 0000		 add	 eax, DWORD PTR _c
  0000b	c3		 ret	 0
?CTest@@YAHXZ ENDP					; CTest

while the "OMG C++ BLOAT" code of TestClass::Test() looks like this:

Code:


?Test@TestClass@@QAEHXZ PROC				; TestClass::Test, COMDAT
; _this$ = ecx
  00000	8b 41 04	 mov	 eax, DWORD PTR [ecx+4]
  00003	03 01		 add	 eax, DWORD PTR [ecx]
  00005	c3		 ret	 0
?Test@TestClass@@QAEHXZ ENDP				; TestClass::Test

Addressing of the variables just got way smaller and clearer. Of course YMMV and after some crinkling the difference will mostly vanish in the depths of context modelling, but I wouldn't discount those C++ features entirely.

added on the 2013-04-06 14:52:21 by kb_

tlm: there may be a VC optimization option for stubless functions (like -fomit-frame-pointer in GCC).

Some assembly-level optimizations can also be done directly from C: looping down instead of up, indexed loops ending at zero, do-while loops to put the condition at the end, the do-undo trick to get rid of "else" jumps, …

added on the 2013-04-06 15:02:34 by rrrola

TLM, rrrola: It's /Oy (aka "omit frame pointer")

added on the 2013-04-06 15:05:11 by kb_

Quote:

Pure asm coding is always best.

Surely you want to keep track of all the pipeline stalls, branch predictions and whatnot? I bet you can get insane from that pretty quickly, while a compiler certainly won't complain about reordering statements for best execution times.

added on the 2013-04-06 15:23:37 by Saga Musix

Quote:

rudi, I have seen compilers do crazy things that a real person would not consider.

Well, obviously compilers are not sentient. So these 'crazy things' they do have been considered by humans, and programmed into the optimization logic of the compiler.

But, back to the original topic... I suppose compiled code will not have all that much entropy, since compilers only use a small subset of the total x86 instructionset, and always apply the same code generation routines to the same type of input. Which means that the compiled code will compress very nicely.
I'm not sure how much you can really win by writing everything by hand. Taking the compression into account seems to add yet another level of complexity to optimizing assembly code.
Or perhaps it even means that you should NOT try to be complex, but just keep to the same simple patterns, like a compiler does. Perhaps saving bytes in the original code, because you think you're being 'smart' ends up in lower compression ratios because the compressor has to use a different encoding for that special case than for other parts of the code, where this would not be the case with a compiler.

Discuss...
Or whatever :P

added on the 2013-04-06 15:25:30 by Scali

pouët.net

4k intro, C vs assembler?

login