OpenGL framework for 1k intro

category: code [glöplog]

I sort of agree. Well one thing I really wanted from this thread was to encourage discussion and sharing ideas, but also to inspire people to come up with new ideas. Already there is some great discussion and I have got some new ideas for myself, I am rewriting my music player and timing system now because of this thread.

Yzi I think there is many ways to do the music and the best option is not only one way. I used tables because that was the first thing I thought of and the player code is very small and the tables compressed very well. But now after reading your ideas I want to experiment with storing only minimal note data in a table and using a binary step sequencer for note on/off. I think some of these ideas could be expanded to a really compact 4k synth too.

I will put my old windows visual C project up later since I changed my 1k to assembly now anyway. This is the simple part of the project and it should not be a secret, really its no different to the IQ framework which has already been around for years except it drops the vertex shader and uses gl_createshaderprogramv. I think the real magic is in the music player and shader code and that is something for each individual to make their own because there is no simple solution for that to fit every situation.

added on the 2014-08-22 19:00:13 by drift

So how big are your MIDI players and data tables, compressed vs uncompressed?

added on the 2014-08-23 08:50:29 by yzi

Well the intro I am working on now does not have very complex music because I put a very big shader in it.... but still I have 2 tables of music data:

Table one is 128 bytes uncompressed and 14 bytes compressed.
Table two is 131 bytes uncompressed and 16 bytes compressed.

Not sure on the player size exactly since the code is all one chunk but I can try and work it out later. You can see how well the data compresses though, actually after a certain amount it hardly adds any. I made the exit time of the intro shorter and deleted about 20% of the music data and it saved zero bytes somehow...

I experimented with using a binary sequencer for note on/off. Even with a very simple code to do a bit test on the sequence data then bit rotate and then jump if the bit was zero it ended up about equal to just storing blank notes in the table. I think if you made more complex music than mine it might be worth using the binaray sequencer. My music has no drums for example.

added on the 2014-08-23 10:06:39 by drift

1KPack doesn't give a nice compression report like Crinkler, but I've deleted some code and recompressed to check.

For Dystaxia :
Compressed size with player and 2048 notes: 1024 bytes
Compressed size with player and 1 note: 929 bytes (-95 bytes)
Compressed size without player: 806 bytes (-123 bytes)

Take these numbers with a grain of salt, I just deleted everything sound-related and compiled without trying to keep the intro runable.

added on the 2014-08-23 14:09:28 by Seven

VS project for people who wanted it:

OpenGL 1k Framework

added on the 2014-08-24 00:50:30 by drift

Thanks!

added on the 2014-08-24 14:41:14 by numtek

Wanted to post some more size optimising tips. Mostly will be asm stuff since thats what I have been working with lately but also some general tips that could apply to C/C++ too. Actually after spending some time trying to save every byte possible the best advice is just "Try everything". Finding the balance between uncompressed size and compressed size is a challenge since every case is unique. Literally changing the value of one byte of code/data or its position will have an effect on everything else when the compressor does its magic.

Variables and Data ==========
You absolutely have to experiment with how and where everything is allocated. Try each variable or data segment with and without static declaration. In asm where you have more control over how and where data is placed in the program it is not as bad but in C/C++ it can easily change the compressed size by 10 bytes or more.

Try every combination of data positions. Not only will it effect compression pattern matching in the data but also in the code segment the patterns will change when the data addresses change.

In asm where you have total control over the variables and data try to reuse as much as possible. For example when storing a string that needs to be terminated with a zero then place it in front of some data that you know starts with a zero and save that extra byte. You may get a lot of zeros at the start and end of some data so just overlap those zeros where possible. After saying all that I have also seen better compression by padding with zeros at the start and end of some data segments (more precisely I had a gain in compressed size when I found a way to delete some zero data, this sort of thing will make you crazy!)

Again in asm experiment both with holding commonly reused data in registers and storing and accessing it from memory. Holding a register obviously is smaller in size before compression but again there are plenty of cases where the compressed size is smaller with using memory to hold the values you need.

Program Code ==========
There is only so much you can do in C/C++ obviously try different order of your code where possible. Try also to make your variable sizes and types always match, this saves the compiler from making nasty conversion code. Repeating something even if it is redundant is good for compression. A simple example in OpenGL if you send your time to the shader through gl_color but you only need 1 time variable, sending the same time data to all the colour channels will compress better than sending time to just one channel then pushing zeros to the rest.

Choose constants that compress well. Multipliers and conditions can be adjusted to find a number that saves a byte or more. Sometimes you might need to sacrifice some precision for this but it is a choice you make, size vs. visuals/sound/performance.

In asm try to use the same registers for everything, obviously use EAX as much as possible since you getting so many returns from function calls anyway. If you 100% know what value a function will return then it is possible to use this. For example a function you know will return zero then if you need to push some zeros you can just push EAX instead.

Sometimes you save some bytes by holding registers, sometimes not. Experiment with using the stack for certain cases to see if helps. Also I have found for some reason that push 0 will compress better than pushing a zero'd register. Sometimes even without the overhead of having to zero the register first it is still smaller to push 0. I am not sure why but it seems that crinkler really loves to compress push 0.

Experiment with condition testing and branching. There are often many ways to test for a condition, particularly if you choose certain conditions that work well with bitwise instructions. Try to make your conditions triggered by flags. Sometimes it will be better to do a decremental loop so your condition is zero. Obviously try it keep it so you can use SHORT branches, this isn't always going to be possible. All this stuff is really important in the music code.

Try different size instructions. Even though you might only need to work with a byte it is sometimes smaller to use larger instructions. I found that it can be smaller just working with EAX rather than using AH and AL. Again I assume this is down to compression patterns, obviousl test every situation to see which works best.

Last thing with patterns I noticed sometimes its better to push all your values in a block then do all the function calls. Sometimes not. Experiment with this.

Shader Code ==========
Declare all your variables of the same types inline. You can reuse the same variable name more than once if it is inside a separate loop or function.

Use the very excellent Shader Minifier tool from Ctrl-Alt-Test. I just use the online one here. If you are hardcore optimising you need to do a lot of it by hand but the best thing that tool does is choose variable names. I know it tries to reuse the most common letters which helps with compression. I didn't think it would make much difference but even in a tiny shader it saved me 3 bytes only by changing the variable names and nothing else.

Minimise your calculations, I sometimes use the math website WolframAlpha to help find ways to express a formula in a smaller size. Obviously put all one-time calculations inline rather than using lots of separate variables.

Wherever possible try to inline and minimise everything inside a function or loop so that it is all on one line. This will save you a pair of curly braces.

Use similar pattern matching tricks as you do in code and other data. Try all combinations of variable orders to see which compresses best. Sometimes a constant may compress smaller by typing it out multiple times rather than declaring it as a #define or a variable.

Again same as I mentioned earlier try and choose constants that compress well by reusing the same values as much as possible and round your floats as much as possible. Use approximations in constants and calculations if it is smaller. Balance this size against aesthetics obviously.

==========
Well thats all for now. I am sure there are many things I am missing, but again just to make it clear the real trick is to try every possible combination and way of doing things. The pattern matching for the compressor seems to be much more important than the uncompressed size. Personally I find it very addictive, as a coder it is fun and good mental exercise, its almost like playing a challenging game. I think I have saved all the bytes I can but then I have an idea to try something else and save another byte or two. Do this many times and you end up with a good result.

Feel free to add your own size-coding advice.

added on the 2014-08-30 06:13:30 by drift

That looks hard to read sorry, not very good at formatting text. There are some good tips I think so try to read it all if you can.

added on the 2014-08-30 06:16:03 by drift

Oh one other thing I thought I should mention regarding register use in asm. I am certainly not an expert at x86 code but I do know that there are special 1 byte opcodes for doing specific operations with specific registers. This could save some bytes, maybe... it would be possible to save some uncompressed bytes but I am not sure if it would compress much better.

Problem is to get this benefit you need to use the registers correctly. So ESI and EDI need to be setup so you can make reads to and from memory cheaper. Then you have the problem of not being able to use those registers as cheap storage space. In a 1k I am not sure there is any size saving by being able to use a couple of LODS/STOS etc. but need to push and pop registers more or something.

Of course there are lots of cheap opcodes you can still use like XCHG and many EAX specific operators have one byte opcodes.

added on the 2014-08-30 07:33:44 by drift

Some minor thread necromancy.

I'd like to comment on some points ts makes, since I've been investigating minimal program creation on *nix side. As noted earlier in this thread, I've implemented a tool for creating 'minimal' ELF32/ELF64 binaries for Linux and FreeBSD, but have unfortunately found that it doesn't really seem to be anywhere near good enough for 1k intros.

As things stand, dnload.py produces a "Hello World!" program in 413 bytes (FreeBSD-ia32). It is also possible to open an OpenGL window, compile shaders and have some very rudimentary time-based animation in 783 bytes. This might sound good enough, but it must be noted that neither of these examples contain any kind of audio. A minimal (but not optimized) raycaster that animates a pulsating ball and plays music from very short programs takes 1177 bytes, which is already way too much if one wants to make an actual intro.

I'd guess one of the essential differences here being the availability of General MIDI. Insofar as I know, there is no be-all-end-all MIDI playback interface available for FreeBSD/Linux like there is for OSX and Windows. One could assume random libraries from desktop environments being available, but it's already a stretch in addition to what is commonly considered to be de-facto available (which is pretty much nothing besides OpenGL, SDL, GLUT and whichever compressor the system seems to be using).

For reference, Amand Tihon had a similar project of his own, already discussed in the earlier thread. He had ported flow2 to Linux. I've modified his port to compile using dnload, available here: http://pastebin.com/cSZ8Xzg5

Using this 'modern' technology, what was once an 1k is now merely 811 bytes, far below what Tihon's BOLD produced. It could easily feature some kind of rudimentary audio, but is still significantly short of being able to deliver what we've come to expect from, say, ts or yzi in this very thread.

I wonder if it simply comes to going full ASM and forgoing C completely. The last time we made an intro in assembler was in '04 and I don't exactly miss those times.

ts

Quote:

In 1k it does not really matter if you have fancy compressor like crinkler or simple LZW/gzip, difference is not that much. Real enemy is the static overhead which comes from executable headers, decompressor, lib import, hashes, gl setup and such. In theory here linux/freebsd could shine since they have lzma installed by default plus they have SDL.

Modern Linux distributions actually do not seem to have LZMA installed, which is a shame. They of course have xz utilizing the same techniques, but the headers of the older .lzma format were smaller. This actually matters even in 4k intros. FreeBSD comes with both lzcat and xzcat included from get-go which is an advantage.

ts

Quote:

Shell dropping is not dead. at least not in *nix platforms. Our overhead for compression is 42 bytes. You just cant write a decent decompressor in that space. Also, with shell dropper you can compress the executable headers

Mind pasting your implementation? Ours currently looks like this:

Code:i=/tmp/i;tail -n+2 $0|lzcat>$i;chmod +x $i;$i;rm $i;exit

58 bytes, since it needs a newline.

ts

Quote:

I've been thinking about sharing our osx framework. On the other hand that would allow more people to participate in 1k compos. but on the hand I'm afraid that sharing such a thing where you can do "drop shader here, put midifile there" framework would halt all the development seen in now 1k's, all the crazy ideas would come to halt and future 1k's would be copies of each other. So I'm not yet convinced which way to go...

Some kind of "input shader here" -frameworks have already been done. For example this one by YOLP. I also kind of pasted one up there, along with the tools necessary. The thing you still have is a competitive advantage!

In any case, I'd honestly be interested in knowing how low does your tool go with the forementioned flow2.

las

Quote:

To be able to use that you will need a "#version" pragma (e.g. "#version 430"). Not sure whether that's smaller but you can get rid of one more call + some strings with a little more shader code.

Thanks for the tip. Using #version 430, layout directive and glCreateShaderProgramv was ~20 bytes smaller in compressed binary size than going OpenGL2 -style.

added on the 2014-09-23 21:27:41 by Trilkk

I'll check the flow2 on osX. I'm interested about that as well

Quote:

Mind pasting your implementation?

Your implementation looks like the original Marq's one. uses chmod and all :)

To get really short, you need to interleave the script and gzip headers. (dunno what is available on other algos) The gzip-headers have 6-bytes of data that is not relevant (timestamp, os, flags) and optional null-terminated string.

Our short one is as follows (does not remove the binary from /tmp since it is not required by rules, that would add 5 bytes or so)

Code:


cp $0 /tmp/z;(sed 1d $0|zcat
<first four bytes of gzip header>)>$_;$_
)
<gzip data>

That is 42 bytes of overhead including the gzip header. 32 if you count only the script

If interleaving is not possible, I suppose you can use:

Code:


cp $0 /tmp/z;(sed 2d $0|zcat)>$_;$_
)

That would be 38 bytes. still shorter, even if you add the 'rm'

Also, obviously we remove the 4-byte crc-32 at the end of the gzip. it just sits there wasting space

added on the 2014-09-24 08:04:55 by ts

Ok, speaking of ugly hacks, here it is:

flow2 for osx

651 bytes. I did not really do any optimization. And I did not do my math how the original used ModelViewMatrix for rotation. I did something similar-ish.

Probably it would be something around 512 bytes if done properly

added on the 2014-09-24 19:20:47 by ts

I seriously looked and played around with your example for hours and couldn't figure out how it could work even in theory because it simply did not make sense in SH.

...until I figured out that since you're coming from OSX, it's probably bash-specific syntax. Which turned out to be the case.

Unfortunately that trick cannot be expressed in plain bourne shell, because it will actually expand $_ into "/bin/sh" and that'll be that. However, the sed trick is applicable and if we assume we won't have to care about any dirty error messages (as you don't seem to) nor remove the filedumped binary (ditto), I can push flow2 down to 797 bytes.

I'm guessing we're starting to hit OS limitations here. To clarify:
- FreeBSD has the disadvantage that an executable binary has to feature two symbols named __progname and environ. Not only do these need to feature in symtab, but the executable must also present a valid hash table for the dynamic linker to find them. This adds roughly 100 bytes of uncompressed binary size.
- Linux can present an empty symtab and omit hash from the binary, but an algorithm doing ELF import-by-hash has to survive from GNU hash tables. It seems libraries in some distributions do not necessarily provide a SYSV hash table at all. Unfortunately searching symbols from GNU hash tables or even figuring out the number of symbols present takes even more space than jumping through the FreeBSD hoops.

My guess would be that with OSX and Mach-O, you don't have either of these issues?

added on the 2014-09-24 23:15:53 by Trilkk

Quote:

flow2 for osx 651 bytes.

Works fine on my Mac Mini, except the Esc key for exit, of course.
Why not to publish your "put shader here" tool? ;) We released one to have fun playing with shaders. IMHO, such tools will not stop the people who interesting in system code/optimization.

added on the 2014-09-25 12:11:27 by Manwe

The sed trick is actually something firehawk found out. That was awesome.

I supposed this is the smallest implementation that works in sh, unless you are lucky to have tac in the default install. that would make the sed go away.

Code:


HOME=/tmp/z;cp $0 ~;sed 2d $0|zcat>~;~
)

moving the actual execution into the gzip-header it is 37 bytes. quite acceptable :)

Quote:

- FreeBSD has the disadvantage that an executable binary has to feature two symbols named __progname and environ. Not only do these need to feature in symtab, but the executable must also present a valid hash table for the dynamic linker to find them. This adds roughly 100 bytes of uncompressed binary size.

We had similar stuff with osX 10.4, it needed to have environ defined. That was bad, fortunately modern osX does not have this limitation. I guess we are lucky

Quote:

- Linux can present an empty symtab and omit hash from the binary, but an algorithm doing ELF import-by-hash has to survive from GNU hash tables. It seems libraries in some distributions do not necessarily provide a SYSV hash table at all. Unfortunately searching symbols from GNU hash tables or even figuring out the number of symbols present takes even more space than jumping through the FreeBSD hoops.

Yes, I remember the idiocy when I was working to implement dl-loader in linux. I hated the fact that linux guys can't decide how their binaries should work :(

Quote:

My guess would be that with OSX and Mach-O, you don't have either of these issues?

Well, we do not have these issues.

We put the libraries needed into the mach-o header in LC_LOAD_DYLIB section where dyld happily loads them. We then browse them to get nlist-table as well as string table and thats that. we implement hashing ourselves and loop through the strings. This is about 160 bytes of code total.

Unfortunately osX has other burdens related to mach-o format and the loader. Apple has hardened the loader. it checks that all the stuff in the headers have to be present and that sections do not interleave nor are incomplete. annoying. 200 bytes going poof.

Quote:

Why not to publish your "put shader here" tool? ;) We released one to have fun playing with shaders. IMHO, such tools will not stop the people who interesting in system code/optimization.

I released this earlier

There has been 3 people using it and only 1 prod being made (except us.)

Not very successful demotool I'm afraid. To make a proper demotool for 1k I would need to clean a code, now it is fugly. If there would be someone actually making prods with it and willing to betatest and feedback, even though the first release would be buggy I would consider releasing it to wider audience :)

I have believed since 2007 that *nix-system have competitive advantage in 1k/4k intros, if the framework is done properly. It is just taken me a quite a long time to write the damn thing :D

added on the 2014-09-25 16:54:16 by ts

ts: I'll try it.

added on the 2014-09-25 17:27:33 by neu / metoikos

ts: I have OSX+Laturi in my demoscene/todo-list for a year already. Sadly, I still hadn't an opportunity to use my free time on figuring out a way to init OpenGL on OSX in as few bytes as possible in non-deprecated way. I believe all fullscreen CGL/AGL stuff has been deprecated for years now, and the expected overhead of going full obj-c cocoa way is frightening. GLUT seems to be the only option left.

added on the 2014-09-25 18:06:34 by provod

iirc __progname and environ symbols are needed either by libc or crt0. Shit doesnt happen when you link with gcc -nostartfiles -nostdlib.
btw we really need an executable packer on GNU/Linux, that trick with bash is crappy :(

added on the 2014-09-25 19:27:23 by stfsux

Manwe: the downside with "insert shader" is, it will reduce diversity and change the game. Then it becomes even more of a shader compo, instead of intro compo. So far it has been interesting to see all these different approaches to program structures. But then again, the progress in the 1k category hasn't been very fast. And this year's Assembly showed that Windows+Crinkler has too much overhead compared to TDA's Mac system, for 1k. There's no way around it.

I still have one reasonably nice unreleased 1k music routine for Windows, so maybe I'll stick it in a compofiller somewhere, but when I have time to return to 1ks more seriously, it will have to be Mac all the way.

added on the 2014-09-25 19:44:19 by yzi

yzi, not a "shader compo" only - don't forget about the music and whole design, idea and concept. The demoscene is about art, after all. We can add some music and made a 4k intro with all that tricks, nobody limits as with 1k, actually.

added on the 2014-09-25 21:49:57 by Manwe

So far there has been lots of moving parts, lots of dimensions of expression as a mandatory thing, because everyone has had to come up with their own basic building blocks for some structural parts. I'm sure you can understand that this has resulted in some creativity that wouldn't have happened in otherwise.

added on the 2014-09-25 22:37:32 by yzi

Quote:

The sed trick is actually something firehawk found out. That was awesome.

I supposed this is the smallest implementation that works in sh, unless you are lucky to have tac in the default install. that would make the sed go away.

Code:

HOME=/tmp/z;cp $0 ~;sed 2d $0|zcat>~;~
)

Actually, I tested this, and lzma/xz produce such an significant advantage that it's not worth it to use gzip, even if the headers would be interleaved.

Unfortunately, investigating the file formats for xz (http://tukaani.org/xz/xz-file-format-1.0.4.txt) or lzma (http://svn.python.org/projects/external/xz-5.0.3/doc/lzma-file-format.txt) does not seem to allow for similar neat interleave tricks. However, .xz seems to have a stream footer - it is possible there could be something to gain from mangling the format like you do with the CRC32.

Also, when using just plain SH, it does not seem that a newline and a ) is necessary to halt execution (to parse error?). A plain newline will do. The shell will attempt to continue parsing the compressed LZMA stream, which will stop execution just as well - the error message will be uglier, but if we're stretching the limits that hardly matters. I modified my tool to allow for 'pretty' and normal filedump mode. The ugly version now looks like this:

Code:

HOME=/tmp/i;sed 1d $0|lzcat>~;chmod +x ~;~
<lzma data>

44 bytes with newline.

But I must thank you again, if cleaning up and exiting properly, using the HOME and ~ trick saves one byte in comparison to i and $i if the file is removed at the end (when just being dirty both ways are of same length).

Quote:

I released this earlier

There has been 3 people using it and only 1 prod being made (except us.)

Insofar as I know, no-one except us has used dnload.py either so ditto. In all honesty, putting it up into google code was just more convenient for me.

stfsux

Quote:

iirc __progname and environ symbols are needed either by libc or crt0. Shit doesnt happen when you link with gcc -nostartfiles -nostdlib.

Would be neat, but no. environ and __progname are required by FreeBSD libc.so and insofar as I can see, cannot be avoided - even if you build the whole binary byte by byte from the ground up (which we do).

Perhaps you're talking about about Linux, where they indeed are not required in the dynamic linking process. However, as said, the advantage gained from omitting symtab, hash table and the symbols is more than offset by having to deal with GNU hash.

it would seem that for *nix size coding, OSX >> FreeBSD > Linux.

added on the 2014-09-25 22:39:42 by Trilkk

Quote:

I'll try it.

Quote:

I have OSX+Laturi in my demoscene/todo-list for a year already. Sadly, I still hadn't an opportunity to use my free time on figuring out a way to init OpenGL on OSX in as few bytes as possible in non-deprecated way. I believe all fullscreen CGL/AGL stuff has been deprecated for years now, and the expected overhead of going full obj-c cocoa way is frightening. GLUT seems to be the only option left.

I have tried them all: GLUT is horror, AGL is plain bad. CGL was the good way earlier but now on the new ATIs deprecated APIs are finally dead (does not create framebuffer by default). So the solution is to use NSOpenGLView.

You can drop me an email and I'll send you preliminary 1k framework, which has the hard work done + new laturi which works with mavericks.

Others can ping me as well

Quote:

So far there has been lots of moving parts, lots of dimensions of expression as a mandatory thing, because everyone has had to come up with their own basic building blocks for some structural parts. I'm sure you can understand that this has resulted in some creativity that wouldn't have happened in otherwise.

True as this is, maybe it is still nice to share some basic stuff though.

Quote:

btw we really need an executable packer on GNU/Linux, that trick with bash is crappy :(

Well, the cab-dropper on windows was a crappy trick. Thats probably what you were thinking ;) On *nix-systems the shell is a nice resource to be utilized. nothing still beats shell dropping on the smallness of decompression overhead.

That being said, I'm working on ppm-packer for laturi. I already have compressor and decompressor in C, and halfway done decompressor in asm. Now I just have to finish it. Still I plan to use shell dropping so I can compress the decompressor and binary headers. Best of both worlds...

Quote:

it would seem that for *nix size coding, OSX >> FreeBSD > Linux.

To me, it sounds like freebsd is better in 4k, and osX better in 1k. Linux being still the strange apple here. Next step for *nix sizecoding is to go against windows crinkler 4k's

added on the 2014-09-28 19:24:37 by ts

I don't think releasing framework code hurts too much, there has already been code around for years on that IN4k website for example. Personally when I first saw the TBC 1k intros I had no idea how they got so much in 1k and I thought there was no point even trying to make a 1k since no matter what I did it would never be as good. But after some time spent learning and experimenting with different ideas I came up with some code which I think is not too bad, probably going to release at FuckJS since there is no other parties with 1k category unless I want to wait until some time next year. My shit is still not as good as what TBC did 5 years ago but at least I am trying to do something rather than complain and do nothing.

I think if we can help more people get that first step towards making a modern 1k with decent visuals and music then maybe we will see more releases and more compos and some new ideas.

That was my idea with this thread to share ideas, not give away every secret but just make the learning curve easier. Now I see the people who have made some of the best 1k intros in the last few years posting and sharing, I am very happy.

added on the 2014-09-29 05:32:36 by drift

ts: I've been unable to find any of your contacts. So I'm leaving my address here, in case you see this message: marflon@gmail.com.

added on the 2014-10-21 09:37:54 by provod

pouët.net

OpenGL framework for 1k intro

login