Kuemmel information 1111 glöps
- general:
- level: user
- personal:
- first name: Michael
- last name: Kübel
- cdcs:
- cdc #1: Puls by Řrřola [web]
- cdc #2: Megapole by Red Sector Inc. [web]
- cdc #3: sp04 - Hello, Kevin - A Dental Journey by Spacepigs [web]
- cdc #4: 0b5vr GLSL Techno Live Set by 0b5vr
- cdc #5: vestige by erpholia
- 256b MS-Dos Space Fungus by Frigo
- ...with the findings to reduce the overhead for the clamping stuff I added double pixel plot by MOVQ which resulted in 26 FPS for rcpps while still keeping the file size (rcpps at 254 Bytes, divps at 248 Bytes) :-) Find the code at the same link as above from my previous post.
- isokadded on the 2019-09-29 19:34:38
- 256b MS-Dos Space Fungus by Frigo
- 128bit write to vidmem brings another speed improvement: from previously 23 FPS to 30 FPS on my laptop :-)
- isokadded on the 2019-09-28 20:51:09
- 256b MS-Dos Space Fungus by Frigo
- I couldn't let this go, so after extensive constants shrink, optimizing far jumps and some hints from TomCat and his FPU shrink I'm down to 252 for the divps version and 258 for the rcpps one including exit/textmode support :-)
Framerate is still 23 FPS for rcpps and about 21 FPS for dvips. You can find it here - isokadded on the 2019-09-27 20:56:18
- 256b MS-Dos Space Fungus by Frigo
- @Tomcat: To understand that you have to look at the architecture of CPU's. For example Intel Skylake. You see that internally a CPU has parallel execution ports for different instructions. The sheduler tries to keep those ports/units as occupied as possible.
So. e.g. a MOVAPS can be executed in port 0,1 and 5. A MULPS in port 0 and 1, a DIVPS only in port 0. So if consecutive instructions are independant (don't need the previous result or modify registers of the previous) those instructions can be executed in parallel in those ports. So also reordering code sometimes helps. It's a bit of try and error.
What instruction runs in which ports can be found in Agner's manuals here.
There are even more internal helpers in modern CPU's to speed up execution time, if I remember e.g. there are internally much more registers and they can be renamed to speed up things. But I'm not really an expert on those things.
Yes, may be replacing the STOSD would also, didn't try that. Calculating 4 pixels though would be a lot of overhead I guess... - isokadded on the 2019-09-23 08:45:10
- 256b MS-Dos Space Fungus by Frigo
- To speed things up even more using the out-of-order capabilities and the multiple instruction ports of those modern CPU's I made a version with an inner loop that calculates location x and x+1 at the same time, so that there are no directly dependant instructions.
Before:
Code:kaliset_loop: movaps xmm2,xmm0 ;d = old p dpps xmm2,xmm2,01111111b ;d = dot(p,p) of first 3 floats and put result in all 4 floats andps xmm0,xmm7 ;p = abs(p) by mask rcpps xmm2,xmm2 ;reverse div+multiply is faster than divps, accuracy seems okay mulps xmm0,xmm2 ;p = abs(p)/dot(p,p) dec bx ;reordered, may be saves some cycles subps xmm0,xmm6 ;p = abs(p)/dot(p,p)-(1,1,0.1)*m addps xmm1,xmm0 ;c+=p jnz kaliset_loop
After:
Code:kaliset_loop: movaps xmm2,xmm0 ;d1 = old p1 movaps xmm5,xmm3 ;d2 = old p2 dpps xmm2,xmm2,01111111b ;d1 = dot(p1,p1) of first 3 floats and put result in all 4 floats andps xmm0,xmm7 ;abs(p1) dpps xmm5,xmm5,01111111b ;d2 = dot(p2,p2) andps xmm3,xmm7 ;abs(p2) rcpps xmm2,xmm2 ;reverse div+multiply is faster than divps, accuracy seems okay dec bx ;reordered, may be saves some cycles rcpps xmm5,xmm5 mulps xmm0,xmm2 ;p1 = abs(p1)/dot(p1,p1) mulps xmm3,xmm5 ;p2 = abs(p2)/dot(p2,p2) subps xmm0,xmm6 ;p1 = abs(p1)/dot(p1,p1)-(1,1,0.1)*m subps xmm3,xmm6 ;p2 = abs(p2)/dot(p2,p2)-(1,1,0.1)*m addps xmm1,xmm0 ;c1+=p1 addps xmm4,xmm3 ;c2+=p2 jnz kaliset_loop
...just takes a lot of bytes...now back at around 279 bytes due to the x+1 preparation and additional bytes for plotting. We can save if we use divps instead rcpps/mulps and reach almost 256 without esc/textmode.
But hey...speed is up from 13 FPS to 23 FPS (!) between those two SSE variants. So more than 4 times the fpu version. Link to the code is here. I guess those kind of lengthy speedcode optimizations would be more usefull for a 512 Byter :-) - isokadded on the 2019-09-22 23:42:53
- 256b MS-Dos Space Fungus by Frigo
- Nice choice of a Kali-set, Frigo :-)
When I was looking through your well commented source code and the shader code, I was thinking to myself that those kind of algoritms just 'cry' for an SSE implementation for speed and may be even size...
...so to revisit my SSE knowledge I coded an SSE version of your intro. It requires SSE level 4.1 CPU's and assembles with FlatAssembler. You can find it for download here.
Even with ESC/textmode support I'm down to 244 Bytes and a huge speed bonus. Your version would run on my laptop at around 5 FPS, the SSE version is around 13 FPS :-)
So in special code cases SSE can be benefitial for 256 Byte intros :-) - rulezadded on the 2019-09-20 22:26:54
- wild Animation/Video Who The Fun Is Bill Gates 4 by CaPaNÑa [web]
- Great idea...lets all not shave until April 2020 and we will look like that next Revision anyway and can do a real photo there :-)
- rulezadded on the 2019-09-16 19:28:07
- 256b MS-Dos DORIAN IS COMING by Astroidea [web]
- Nice idea...general midi is your friend :-) ...as Hellmood said I also guess it's doable in 128 Byte...
- rulezadded on the 2019-09-16 17:28:09
- 256b MS-Dos Sakura 桜 by Řrřola [web]
- I think as screen clearing at first isn't needed you could save 2 Bytes in the beginning with this hack from Hellmood that he told me once:
Code:sub al,-(0x13+0x80) ;no screen clearing and provides address by code itself mov gs,word[si] ;results in 0x6d2c which seems to work... - isokadded on the 2019-09-16 14:23:59
- wild Animation/Video Demoscene tutorial by Nagz [web]
- ...should be on wikipedia for the newbies
- rulezadded on the 2019-09-15 17:32:09
account created on the 2011-06-01 22:43:05
