Squeezing Pong into less than 8k instructions (or even less than 6k...)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view

Squeezing Pong into less than 8k instructions (or even less than 6k...)

This post was updated on .
I want to share a code generation technique that allows to significantly reduce the code size of Hack programs. Having said that, there is non-surprisingly a tradeoff: the speed will also decrease (init time until Pong actually starts may take up to 18 seconds...).

I have achieved the following code sizes (for all versions the 2 Sys.wait calls in PongGame have been removed - same as in the official Pong.asm from project 06):

Pong (official OS and compiler): 7288 Pong_refOS.hack
Pong (official OS and own compiler/optimizer): 6568 Pong_refOSopt.hack
Pong (own OS and own compiler/optimizer): 5596 Pong_akOS.hack (actually much faster than the original because my Screen.drawRectangle is more efficient)

The code density for the above is between 1.85 ad 2.27 ASM instructions per VM instruction!

For Pong itself this is more a proof-of-concept since you can easily reduce its size below 32k with various ideas posted in this forum (i.e. dead code removal, global return function, ...).

However, for more complex programs this will not be sufficient. I have implemented nand2tetris on a FPGA and wanted to have something more impressive than Pong running on it. For example the raycaster presented here
(Current generated code size 24047, at first this seemed impossible since the program plus official OS has 28601 VM instructions (excluding labels) after dead code removal. But the code is full of constant array accesses (pre-computed trigonometry), which my optimizer reduced significantly, ending up with only 10855 VM instructions. Code density of 2.22 ASM/VM. I will publish it soon.)

The slower speed is not really an issue on the FPGA since it is running much faster than the official CPU emulator (which runs with JDK8 at about 1MHz).

How do I achieve this:
1. Threaded Code
2. Code deduplication
3. Improved VM code generation (for the optimized versions)
4. Lots of small nasty hacks exceeding the scope of this post...

1. Threaded code
This is the most important feature. The main idea is that VM commands are not generated inline but instead as internal subroutines:
// save return address
// push
//internal return

The trick here is how can we keep the call to the internal subroutine small. It is typically assumed that an internal call requires 4 ASM instructions:

If you cascade many of such calls you will notice, that all the return addresses are just 4 words apart. In other words D is increased by 4 for every internal call. Afterwards each of the internal subroutines will save D into some internal register (let's say R15). That sounds quite inefficient, doesn't it? What if we permanently use R15 as some kind of internal program counter? Instead of passing a return address to the internal routine, the program always "knows" where it is and where to continue.
This is how it looks like:
// no need to save a return address
// directly start with the push
// PC+=2
// increase R15 by 2 and jump to that address (which is the next call to an internal VM command)

// call to push local 5 no longer requires to pass a return address
// and so on...

So we have managed to reduce each call to a VM command to only 2 words! This is the main principle, though you can still improve a lot in order to reduce the overhead of the internal subroutines. Specifically I am using parametrized subroutines. I don't want to go too much into detail here, but you may figure out when analyzing the code in the CPU emulator.

The code for increasing R15 by 2 can be done with 4 ASM instructions at the end of each internal subroutine. I have opted for another solution, putting the code into another subroutine, leaving us with only 2 instructions to call that routine. With a nasty hack, the speed overhead for R15+=2 is only 2+3 cycles this way. So I am saving 2 instructions per occurrence and losing only 1 cycle on speed. That's why I thought it is worth it. You may want to figure out how the PC+2 routine works in the CPU emulator...

This approach will lead to a rather unusual distribution of instructions with nearly 50% A-Commands and 41% jumps(!):
The code is basically an endless cascade of jumps. You can think of threaded code as a Hack-based interpreter or emulator for VM code.

2. Code deduplication
The entropy of Hack code is generally quite low, with lots of instruction sequences repeating. This is true on both VM level and ASM level. The official OS contributes even more to this due to the charset bitmaps (my own OS uses a compressed charset requiring only 285 constants instead of 96*12=1152...).
My code generator is looking for such repeating sequences and putting them again in internal subroutines. You may look at some of the VM files and will for sure recognize typical sequences yourself. Depending on the sequence this will typically require an internal return address to be stored (I use R14).

3. Improved VM code:
This is more a topic for the compiler project. But just to outline some of the main things:
- improved array handling
  - avoid the temp 0 storage (a[expr1]=expr2 -> if expr1 is calculated AFTER expr2 there is no need to use temp)
  - prevent unnecessary update of pointer 1 (accessing the same array twice)
  - using that <const> instead of setting pointer 1 with adding a constant and then using that 0
- improved conditional jumps/flow
- handling of void functions (already suggested by some here, but I believe they missed the point that just removing the pop temp 0 will lead to a corrupted stack over time)
  - remove push constant 0 (this can generally be done safely since SP is restored based on ARG by RETURN routine)
  - separate global return routine for void functions, avoiding to leave a return value on the stack, and remove the pop temp 0 for DO statements (be careful, it is allowed to call a non-void function with do, just ignoring the return value...)

Final comment:
As mentioned this technique only makes sense if the program doesn't fit into the 32k. Otherwise a normal inline code generation approach will do better. Theoretically a mix of both approaches is possible to achieve the maximum speed while remaining below 32k (or for example opt for in-lining the increase of R15 instead of a subroutine as described above)

Is this the best we can do? Definitely not. There are still a few things that I haven't implemented. A simple thing to increase speed while maintaining the size is to change the convention for push and pop as suggested in another post (SP pointing to last stack element instead of pointing behind it). While this reduces the size of inline code, the threaded code size is more or less not impacted by it since the stack access is in subroutines.

The handling of the code deduplication is also not optimal and can be further improved, both to reduce size (however not much...) and improve speed.

The approach in my code is generally stack inefficient and accesses the stack with every single VM command. This approach could be changed into using the D register (which is typically what you do for speed optimized inline code). Doing this with threaded code is possible, but it will create much more overhead (prevents parametrization). Without having tested it, I would guess that the code size would be larger, but a bit faster.

I have also experimented with some compressed VM code interpreter, encoding 3 of the most common VM instructions into one word (5 bits each). While this is possible and I made a PoC with an interpreter kernel requiring 76 ASM instructions, it will further and significantly reduce speed and I don't think the additional size reduction would justify.

One thing that I have implemented, but which doesn't work with the official CPU emulator due to a bug discussed here is a feature that I call setOnJump. It makes use of the fact that a jump often goes to a label which is followed by @SP (or @LCL). Setting A to zero (which is what @SP does) can be achieved while jumping:
While this works in the Hardware Simulator and on real Hardware, the official CPU emulator wrongly updates A before executing the jump, jumping to address 0 (unless you install my patch as posted in the thread). So I have omitted this in the above versions. It will speed up the code by about 4%.
Another idea is to replace some Jack-based OS routines with ASM code. This is especially useful for Math.multiply and divide.

I hope the general concept is clear. Feel free to ask if I haven't made things clear enough :-)
Reply | Threaded
Open this post in threaded view

Re: Squeezing Pong into less than 8k instructions (or even less than 6k...)

This post was updated on .
Some further clarifications:
What makes the above code slower versus straightforward inline code:

1. Overhead for calling and returning to the internal subroutines
   Total 7 cycles
    - 2 for the call
    - 2 for the call to the increase R15/PC  routine
    - 3 for the actual increase and jump back to the program
2. The internal routines are by definition always generic
   Local optimization of VM commands is mostly not possible
   (i.e. treating pop local 8 differently from pop local 4)
3. All internal subroutines use the stack
   A speed optimized version needs much less stack accesses
4. Code deduplication is generating additional overhead

Code density:
The marginal code density for additional VM instructions is converging towards 2. At some point all required VM instructions are already covered by an internal subroutine, so just one call with length 2 has to be added for each additional VM instruction.
Generally the lower the entropy of the VM code, the better my VM translator can "compress" and make use of code deduplication. Therefore programs using the official OS can often achieve less than 2 ASM instructions per VM instruction. So lots of the entropy inefficiency is eliminated by the VM translator. However, better VM code (with higher entropy) will still result in smaller generated code, though with a slightly higher KPI ASM/VM instructions.

Raycaster Hackenstein 3D:
One thing I forgot to mention: I also use string compression for the string literals, encoding 2 or 3 characters in one constant. The effect is not material though. The main reason for reducing the VM code size remains the array handling. Instead of using String.appendChar, the VM Code will then use my own class String2.appendMultipleChars to decode. This has not been used for Pong, since there are only 2 small string literals (Score and Game over) and it doesn't make sense.
Another thing about H3D if you want to handle it in your VM translator yourself: it requires a correct implementation for signed integer comparisons. Which means you have to handle a signed overflow when subtracting 2 values for comparison:
 (Example: -20000<20000 would otherwise lead to -20000-20000=-40000 -> overflow -> 25536)

Since Pong requires only a fraction of the full charset, you could omit all other characters in Output.init :-) For my own OS this is not that easy because I work with a compressed data stream. But just for fun I created a reduced charset version of the official OS and combined it with the rest of my OS.
Result: 5078 ASM instructions, so technically even below 5k :-)
Reply | Threaded
Open this post in threaded view

Re: Squeezing Pong into less than 8k instructions (or even less than 6k...)

Really late to the show here, and I should be working, I'd just like to say, all of this is fantastic !

In many ways it's a testament to the authors, that a few folks (in relation to all the folks that are on the planet) are getting so much out of it. Perhaps not a surprise when one looks at the field of computer science, but I still think the work you've done is fantastic!

Reply | Threaded
Open this post in threaded view

Re: Squeezing Pong into less than 8k instructions (or even less than 6k...)

I have improved my threaded code generator a bit further: It is mainly linked to an optimization in the way parametrized internal subroutines are called.

With that I have achieved the following sizes translating Pong:
1. using official OS and compiler: from 7288 down to 6990 Pong_refOS_v2.hack
2. using official OS and own compiler/optimizer: from 6568 down to 6199  Pong_refOS_opt_v2.hack
3. using own OS and own compiler/optimizer: 5596 down to 5135  Pong_akOS_v2.hack
4. using truncated charset and own size optimized OS: from 5078 down to 4157 Pong_tiny.hack
(Up to here this is all purely based on translated Jack Code and VM files from the OS)

5. using truncated charset and some asm OS routines(new): 3921  Pong_tiny_asmlib.hack
(It includes asm versions of Math, Memory (both complete), Keyboard.keypressed and an internal routine to calculate the pixel address in the screen RAM)
In all versions, the 2 calls to Sys.wait have been removed (same as in the official ASM version from project 6).

How does it work:
Just to recap, the translated program with the threaded code looks like this:

// push constant 3
// push constant 4

And somewhere in the program you have the subroutines:
// jump to subroutine that will increase the address in R15 by 2
// and continue execution from that address
// Next internal subroutine ...

This can be improved with parametrization. The idea is that it's not necessary to create a new subroutine for neighboring constants (or neighboring locals, args,...). Instead it is possible to adjust D directly in the jump command to pass a parameter:
D=-1;JMP, D=0;JMP or D=1;JMP
By doing so only one subroutine is necessary for 3 neighboring constants:
D=D+A // D will be 3, 4 or 5 depending on D at the entry point.

This was already included in the previous versions. I have extended the concept a bit further: In my code, I ensure that D is always 1 before the next threaded subroutine is called (e.g. when it returns from the PC increase routine). With that guarantee, there are now 5 possible parameter combinations together with a jump:
D=!D, D=-1, D=0, D=1, D=D+1, ranging from D=-2 to D=2

As a consequence less internal subroutines are needed to cover the range of parameters. In the above example the same subroutine could now handle constants from 2-6 instead of only 3-5.

In principle it would be even better to guarantee that D is 2 instead of 1 at the respective points in the code. With that 7 values could be assigned to D at no extra costs (-3 to +3). While that is possible, it clashes with some other optimizations in my code, so for now I didn't implement it. Maybe I will find some time to work on that in the future. The effect will not be significant in most cases though due to diminishing returns.