Amiga democoding?
category: general [glöplog]
"I know some people able to produce faster code than any compilers out there but in the end such stunts are a waste of time & totally out of question for any descent sized project (even a 64k intro)."
It's a rumour that it's hard to beat current C/C++ compilers' code generation. Compilers being good doesn't mean that they generate code that's impossible to beat by hand, it just means that in 98% of the cases the results are good enough to make additional hand-tweaking uninteresting.
Also, using ASM is definitely not a "stunt". Very little of our codebase for e.g. Werkkzeug3 is handoptimized, in terms of "lines of code", but in the precalc phase, easily 50% and more of the actual code executed will be handoptimized innerloops. For massive pixel pushing such as you'll have in texture generators, C/C++ code is not really an alternative.
But, ofcourse writing programs in 100% assembler is just a total waste of time. A large percentage of your code will be executed so infrequently it wouldn't matter if it was interpreted BASIC, in most programs anyway. Using ASM for that type of code is just utterly uneconomical, because no matter how nice your machine's assembly language is, you'll still easily have 10x the typing and debugging overhead if you use ASM instead of some HLL.
  
It's a rumour that it's hard to beat current C/C++ compilers' code generation. Compilers being good doesn't mean that they generate code that's impossible to beat by hand, it just means that in 98% of the cases the results are good enough to make additional hand-tweaking uninteresting.
Also, using ASM is definitely not a "stunt". Very little of our codebase for e.g. Werkkzeug3 is handoptimized, in terms of "lines of code", but in the precalc phase, easily 50% and more of the actual code executed will be handoptimized innerloops. For massive pixel pushing such as you'll have in texture generators, C/C++ code is not really an alternative.
But, ofcourse writing programs in 100% assembler is just a total waste of time. A large percentage of your code will be executed so infrequently it wouldn't matter if it was interpreted BASIC, in most programs anyway. Using ASM for that type of code is just utterly uneconomical, because no matter how nice your machine's assembly language is, you'll still easily have 10x the typing and debugging overhead if you use ASM instead of some HLL.
I never said using asm was a stunt, i just say that while it's not impossible, trying to make faster code than for eg. a compiler such as the intel's one is not worth the time spent. 
You can't even be sure that your hardcore optimized routines will work optimally on every processors around (in fact they probably won't), also processors recognize certain sequences of intructions usually produced by compilers & execute them faster, etc.
To my view, asm is only interesting for size reduction.
  
You can't even be sure that your hardcore optimized routines will work optimally on every processors around (in fact they probably won't), also processors recognize certain sequences of intructions usually produced by compilers & execute them faster, etc.
To my view, asm is only interesting for size reduction.
haha.. assembler is for weenies. Coding directly in machine language (hex dump) is much more efficient.
78 A9 00 8D 20 D0 8D 21 D0 A9 00 8D 14 03 A9 21 ....
  
78 A9 00 8D 20 D0 8D 21 D0 A9 00 8D 14 03 A9 21 ....
Btw.. being able to code assembler does not mean to know all mnemonics off head. It means to know that the CPU is doing by heart. And most preferably it also means to know where a C compiler is better used.
  
ryg: isn't suppose stuff like mmx/sse-intrinsics just fine for "massive pixel pushing"? also, let's not forget that it's initcode... sure, it's nice to have low loading times, but it's not critical imo. (ie, i'd be pleased with compiler-code + intrinsics for precalc-time)
  
"trying to make faster code than for eg. a compiler such as the intel's one is not worth the time spent"
As said, very much depends on the actual code. And Intel CCs image is in some aspects a lot better than the reality of the code it generates. For example, when using SSE intrinsics (which is, btw, a sore spot both with MSVC and with Intel) I've seen MSVC-compiled code being more than twice as fast as the Intel CC counterpart. (No, not some arbitrary contrived example, an actual application - refer to http://www.flipcode.com/cgi-bin/fcmsg.cgi?thread_show=25914).
Putting blind faith in what your compiler does is faith misplaced.
"You can't even be sure that your hardcore optimized routines will work optimally on every processors around (in fact they probably won't)"
Bullshit. As long as you use instructions supported everywhere, your code will run everywhere.
"also processors recognize certain sequences of intructions usually produced by compilers & execute them faster"
Again bullshit. x86 instruction decoding is complicated enough as it is - it's just that the subset of x86 opcodes that current compilers use happens to be fast instructions, for the biggest part (with certain notable exceptions, like bit shifting instrs/certain addressing modes on P4).
"To my view, asm is only interesting for size reduction."
Contrary to what you seem to think, with things like SIMD instruction sets (both automatically vectorized stuff and "compiler-aided asm" like intrinsics) you can still expect major gains from writing the code yourself instead of letting the compiler do it.
I agree that in most cases it's not worth the hassle, but sometimes it just has to be as fast as possible. Video (de)coding is a prime example. (Check the ASM code percentage in current state-of-the-art video codecs...)
kusma: as mentioned above, code generated using intrinsics is usually between average and awful, but almost never really good. in our case, faster init is a nice side effect, but what we're really optimizing is editing speed, which is somewhat critical if you want your artists to stay happy :). besides, there are some things like software skinning and outline detection (damn stencil shadows) that tend to be a noticeable cpu hog unless optimized properly.
  
As said, very much depends on the actual code. And Intel CCs image is in some aspects a lot better than the reality of the code it generates. For example, when using SSE intrinsics (which is, btw, a sore spot both with MSVC and with Intel) I've seen MSVC-compiled code being more than twice as fast as the Intel CC counterpart. (No, not some arbitrary contrived example, an actual application - refer to http://www.flipcode.com/cgi-bin/fcmsg.cgi?thread_show=25914).
Putting blind faith in what your compiler does is faith misplaced.
"You can't even be sure that your hardcore optimized routines will work optimally on every processors around (in fact they probably won't)"
Bullshit. As long as you use instructions supported everywhere, your code will run everywhere.
"also processors recognize certain sequences of intructions usually produced by compilers & execute them faster"
Again bullshit. x86 instruction decoding is complicated enough as it is - it's just that the subset of x86 opcodes that current compilers use happens to be fast instructions, for the biggest part (with certain notable exceptions, like bit shifting instrs/certain addressing modes on P4).
"To my view, asm is only interesting for size reduction."
Contrary to what you seem to think, with things like SIMD instruction sets (both automatically vectorized stuff and "compiler-aided asm" like intrinsics) you can still expect major gains from writing the code yourself instead of letting the compiler do it.
I agree that in most cases it's not worth the hassle, but sometimes it just has to be as fast as possible. Video (de)coding is a prime example. (Check the ASM code percentage in current state-of-the-art video codecs...)
kusma: as mentioned above, code generated using intrinsics is usually between average and awful, but almost never really good. in our case, faster init is a nice side effect, but what we're really optimizing is editing speed, which is somewhat critical if you want your artists to stay happy :). besides, there are some things like software skinning and outline detection (damn stencil shadows) that tend to be a noticeable cpu hog unless optimized properly.
Quote:
"trying to make faster code than for eg. a compiler such as the intel's one is not worth the time spent"
As said, very much depends on the actual code. And Intel CCs image is in some aspects a lot better than the reality of the code it generates. For example, when using SSE intrinsics (which is, btw, a sore spot both with MSVC and with Intel) I've seen MSVC-compiled code being more than twice as fast as the Intel CC counterpart. (No, not some arbitrary contrived example, an actual application - refer to http://www.flipcode.com/cgi-bin/fcmsg.cgi?thread_show=25914).
That's possible, it's just that i'm talking about a time that should be spent on finding a better algorithm instead of juggling with opcodes
(unless you're specifically paid for this job, that is).
Quote:
"You can't even be sure that your hardcore optimized routines will work optimally on every processors around (in fact they probably won't)"
Bullshit. As long as you use instructions supported everywhere, your code will run everywhere.
optimally
Quote:
"also processors recognize certain sequences of intructions usually produced by compilers & execute them faster"
Again bullshit. x86 instruction decoding is complicated enough as it is - it's just that the subset of x86 opcodes that current compilers use happens to be fast instructions, for the biggest part (with certain notable exceptions, like bit shifting instrs/certain addressing modes on P4).
I was talking about the patterns learning/recognitions for branchs schemes. But it's not just about using fast instructions; Cache misses, branches mispredictions, unrecognized patterns, unpaired instructions, misalignments, registers & memory stalls, addresses interlocks, trace cache delivery rate, and more can have a serious impact on the speed of the code.
Quote:
"To my view, asm is only interesting for size reduction."
Contrary to what you seem to think, with things like SIMD instruction sets (both automatically vectorized stuff and "compiler-aided asm" like intrinsics) you can still expect major gains from writing the code yourself instead of letting the compiler do it.
Probably, but see first point, and during intel's history we seen that a valid optimisation for a generation of their processor can be obsolete (outdated) for the next one.
@Slummy
I thought the reason for so few 512 or 256 byte Amiga intros is the overhead of the hunk format being so much larger than the overhead of the MS-DOS executable header. Its like trying to write 256 byte PE-executables, I guess.
  
I thought the reason for so few 512 or 256 byte Amiga intros is the overhead of the hunk format being so much larger than the overhead of the MS-DOS executable header. Its like trying to write 256 byte PE-executables, I guess.
There are some routines that I have written in ARM ASM that I am sure never could get to run just as fast as the C code that did the exact same thing. All effects in EffekWerk, except for the outtro, would never have been just as fast as they are if they were written in C.
  
>There are some routines that I have written in ARM ASM that I am sure never could get to run just as fast as the C code that did the exact same thing.
So,. C was producing faster code than ARM?
>All effects in EffekWerk, except for the outtro, would never have been just as fast as they are if they were written in C.
So,. ARM was producing faster code than C?
p.s. Can't make up my mind which of the two cases you mean ;)
  
So,. C was producing faster code than ARM?
>All effects in EffekWerk, except for the outtro, would never have been just as fast as they are if they were written in C.
So,. ARM was producing faster code than C?
p.s. Can't make up my mind which of the two cases you mean ;)
xeron: well, for a 512b intro it's not *that* big a deal to waste 32b on a single-hunk-header, but for all I know that might be a lot compared to other platforms.
I think people should do bootblock intros instead, since those crammed with THE SPIRIT OF AMIGA!!
  
I think people should do bootblock intros instead, since those crammed with THE SPIRIT OF AMIGA!!
I did some friendly bootblocks competition back in early 9x, it was a rather interesting exercise.
  
hitchhikr: the one arranged by the eurochart-crew? should probably dig those entries out from somewhere...
  
you seem to compare C and ARM, but ARM is a cpu-architecture and C is a programming-language. thus, i'm assuming that by "ARM" you mean ARM asm.
the thing about arm (and more specifically GBA) is that the compilers produce good enough code most of the time, and as the memory isn't cached on a gba, you usually end up being more or less bound by memory-transfers rather than actual calculations. imo the important thing to know about optimizing (no matter platform) is to allways optimize based on detailed profiling to ensure you're optimizing your actual bottlenecks.
  
the thing about arm (and more specifically GBA) is that the compilers produce good enough code most of the time, and as the memory isn't cached on a gba, you usually end up being more or less bound by memory-transfers rather than actual calculations. imo the important thing to know about optimizing (no matter platform) is to allways optimize based on detailed profiling to ensure you're optimizing your actual bottlenecks.
uhm, that post was aimed at Optimus...
  
"That's possible, it's just that i'm talking about a time that should be spent on finding a better algorithm instead of juggling with opcodes (unless you're specifically paid for this job, that is)."
I am specifically talking about applications where you are already using a (provably) optimal algorithm. First picking a suited algorithm is obviously necessary, I never argued against that. Architecture-specific optimization is what you do when that still isn't fast enough (and it can easily make a difference of factor 12 and more).
The optimisation stuff: Misread you there, I thought you wanted to say that the optimized code wouldn't run at all, not that the code didn't run optimally. In any case, you have the same problem with compiler-generated code, so it is completely irrelevant to this discussion.
"I was talking about the patterns learning/recognitions for branchs schemes. [..]"
All things you mentioned are nice and well, but compilers don't actually optimize for most of these. They can't compensate for cache misses (because that is dependent on the data layout, which a compiler is not allowed to change in any significant way from what is specified in the program), branch mispredictions (which depends on the conditions being checked themselves; aside from obvious things like moving the most likely case to the top and replacing simple conditionals by computation, there's nothing a compiler can do to make code better predictable),
but things like cache misses (which are dependent solely on the data layout, that is specified by the programmer, not the compiler).
Compilers are by now rather good at scheduling instructions and minimizing dependencies in generated code. That is about the only thing in your list that compilers actually do.
  
I am specifically talking about applications where you are already using a (provably) optimal algorithm. First picking a suited algorithm is obviously necessary, I never argued against that. Architecture-specific optimization is what you do when that still isn't fast enough (and it can easily make a difference of factor 12 and more).
The optimisation stuff: Misread you there, I thought you wanted to say that the optimized code wouldn't run at all, not that the code didn't run optimally. In any case, you have the same problem with compiler-generated code, so it is completely irrelevant to this discussion.
"I was talking about the patterns learning/recognitions for branchs schemes. [..]"
All things you mentioned are nice and well, but compilers don't actually optimize for most of these. They can't compensate for cache misses (because that is dependent on the data layout, which a compiler is not allowed to change in any significant way from what is specified in the program), branch mispredictions (which depends on the conditions being checked themselves; aside from obvious things like moving the most likely case to the top and replacing simple conditionals by computation, there's nothing a compiler can do to make code better predictable),
but things like cache misses (which are dependent solely on the data layout, that is specified by the programmer, not the compiler).
Compilers are by now rather good at scheduling instructions and minimizing dependencies in generated code. That is about the only thing in your list that compilers actually do.
Quote:
I am specifically talking about applications where you are already using a (provably) optimal algorithm. First picking a suited algorithm is obviously necessary, I never argued against that. Architecture-specific optimization is what you do when that still isn't fast enough (and it can easily make a difference of factor 12 and more).
In most cases, algorithms can be optimized enough without having to juggle with opcodes. Nowadays, asm remains marginal, unless the target platform isn't powerful enough to run C (or any other language) compiled code at descent speed. Resistance is useless.
Quote:
The optimisation stuff: Misread you there, I thought you wanted to say that the optimized code wouldn't run at all, not that the code didn't run optimally. In any case, you have the same problem with compiler-generated code, so it is completely irrelevant to this discussion.
Quite the contrary ;D
There's no need to waste hours doing hardcore optimisations because of this.
(No need to speak about the timing & architecture differences between amd & intel processors, etc.).
Quote:
"I was talking about the patterns learning/recognitions for branchs schemes. [..]"
All things you mentioned are nice and well, but compilers don't actually optimize for most of these. They can't compensate for cache misses (because that is dependent on the data layout, which a compiler is not allowed to change in any significant way from what is specified in the program), branch mispredictions (which depends on the conditions being checked themselves; aside from obvious things like moving the most likely case to the top and replacing simple conditionals by computation, there's nothing a compiler can do to make code better predictable),
Actually, compilers also remove conditional branchs wherever they can (or even functions if possible) & also ensure automatic alignments of datas, among other little things.
Understand that it's the accumulation of all these factors (and some others) that makes asm optimisations cumbersome & time consuming (and money consuming, if money is involved).
i'm for the reasonable solution anyway: only call captain future when all other possible solutions have been tried.
and when there's no hope left :(
  
Ahh,. there are people who can do miracles in pure C (Dairy ;) and those who truly know how to optimize one single effect in asm (Come on, trying to work upon the already generated assembly code by the compiler wouldn't give much! But who thinks about organizing your data and registers in the best way in order to write the most clever and optimal speedcodes and get thrice than the compiler code? I don't know about modern PCs though, but for older machines it should surely work and I doubt any compiler has the A.I. to do that.. ;P)
C and ASM rulez =)))
  
C and ASM rulez =)))
I know..
I have changed a bit..
I don't code Qbasic anymore (but Freebasic has a promissing community ;)
C is for lazy people like me and I port my code. I code Java too ;)
And I still love assembly when I am not lazy..
  
I have changed a bit..
I don't code Qbasic anymore (but Freebasic has a promissing community ;)
C is for lazy people like me and I port my code. I code Java too ;)
And I still love assembly when I am not lazy..
Optimus: gcc on arm is mostly able to generate quite damn optimal code GIVEN GOOD C-CODE. and as opposed to what you state here, the code generated by the compilers ARE usually a good place to start. if you want to maintain pure-assembly-functions, that is. otherwise, you could simply clarify what you meant by a couple of inline-asm-statements. remember, the compiler only does what you tell it to do. bad output from the compiler is usually a problem with the input, not the compiler.
  
optimus, just a thought, but maybe you should actually pick a platform and write some code for a production first which is worth optimising, and THEN worry about how you'll optimise it.
  
>bad output from the compiler is usually a problem with the input, not the compiler.
I agree with that.
But, at least if I am not mistaken, I still think that when most people claim that you can't get much more speed with asm than C (other than some 10% after hard working, which is not worth as they say), they have only tried typical asm code, quite predictable and easy to read. That's not the limit. Some speedcodes can become so insane and diferrent than your typical compiler code, so wild, so wicked that you cannot beleive it, certainly not your compiler. And yet someone can come up with something faster! I don't know if it works today on PC, but not even in slow machines with ARM? I don't beleive that, pitty I don't have the time to proove it to myself this day we are talking..
  
I agree with that.
But, at least if I am not mistaken, I still think that when most people claim that you can't get much more speed with asm than C (other than some 10% after hard working, which is not worth as they say), they have only tried typical asm code, quite predictable and easy to read. That's not the limit. Some speedcodes can become so insane and diferrent than your typical compiler code, so wild, so wicked that you cannot beleive it, certainly not your compiler. And yet someone can come up with something faster! I don't know if it works today on PC, but not even in slow machines with ARM? I don't beleive that, pitty I don't have the time to proove it to myself this day we are talking..
Opimus: the optimizing-rules atleast for arm is quite straight forward. you can quite easily count cycles spent, and predictions are usually dead on atleast when i time my code. apart from that, you can do some pretty wierd c-constructs that generate quite fast code as well. duff's device is a nice example of this. and as smash says, optimize on purpose.
  
I have tried unrolled codes in C too (not in the way you can do them in assembly though, where it's much easier and direct for such kinds of stuff), but I will check this Duff's device anyways. No point of commenting more though :P
  
.jpg)








