I wanted to share my work in VM optimization because I've done some stuff that hasn't been discussed here.
There has been a lot of work on optimizing generated ASM files for size, in particular by cadet1620 who made a post detailing their impressive journey. Their implementation does a comprehensive replacement of repeated code for general subroutines at some cost to execution speed. Writing of the logic required to do this dynamically is in itself a feat.
My philosophy to optimization is not to reduce size as much as possible. Rather, it is to reduce size as much as possible without sacrificing speed. This is why my implementation does not use subroutines apart from single-level subroutines for call and return, which are used across all levels (because not using them is dumb). My focus is rather on memory-to-memory operations and small changes to the Hack VM contracts imposed in the course.
As a disclaimer, at the time of writing this I have only gone as far as chapter 8 in the course. I have not yet implemented the compiler or OS, so my only benchmarks will be the supplied OS and Pong files compiled with supplied JackCompiler. The listed line counts are for the generated output, which means labels are included.
Optimization level 0: 40815 lines
This is my output without any optimizations (other than the global call and return subroutines). I decided to include the global call and return by default because otherwise comparisons with the optimized outputs would become meaningless. Just know that the output with in-line calls and returns is upwards of 60k lines.
With dead code removal: 32999 lines
I added dead code removal as a separate optional flag to my translator so it can be applied to any optimization level. I don't really consider this an optimization since it doesn't change the way instructions are written, but rather fixes a flaw in the way the Hack OS STL is implemented. This allows even unoptimized Pong to run on the CPU emulator which is pretty cool.
Dead code removal stays active for the rest of the benchmarks below. The real fun begins now...
Optimization level 1: 21358 lines (35% reduction)
The level 0 translation is very verbose. It follows the course textbook to the letter. It only uses single-destination instructions, and all instructions of a same type follow the exact same pattern. It's how I implemented my first translation, back when I struggled to even implement the pop command.
After a while I realized I could make things more efficient by using multiple-destinations in one line. Then, I realized that the Hack CPU could do a bunch of other things I wasn't using, which could make my output better: things like "M=0" meant I didn't need to load zero onto A; "M=M+1" meant I didn't need to load and then add the one; "A=A+1" meant I could shave off a few lines when searching for addresses with a low index. This first level of optimization inolves two key observations; the first being that the translated output can be substantially optimized simply by exploring the operations made available by that modest CPU we built in Part 1. I call this "blind optimization", since instructions are optimized individually but are blind to each other.
This is really exciting when you first realize it, but I guess everyone naturally gets to at least this point when they realize their code doesn't fit the ROM. The course doesn't provide a VM translator, but if you look at Pong.asm from project 06 (Assembler) you will notice that their translator does these optimizations.
The second observation, which is a little more exciting to me since I haven't seen it mentioned, is that we should optimize the VM contract for push and not for pop.
An average program uses between 2 to 4 times as many pushes than pops, depending on how complex are its functions. This should be obvious given that we have several binary operations. The way the VM contract defines the stack, however, is optimized for pops. By default, the book determines that the stack pointer should always point to the next available register. This results in the following definitions for push and pop:
// get the value of x
// copy x into the stack
// get the address of y
// put the value from the stack into y
(I'm showing only the necessary to illustrate my point, to avoid spoiling anything for other students.)
You can see that pushes end up being longer than pops. Even if the difference is only 1 line, if you consider a program with ~2000 pushes and ~500 pops, that's an extra 1500 lines of overhead. Not only that, but this also means that every binary operation is forced to decrement SP twice in order to access its operands. Unary operations have to decrement it once. Both have to re-increment once again at the end.
The idea, then, is to change the stack pointer so that it points instead to the last occupied register. This inverts the push and pop, so that pops are now 1 line longer instead. And BAM you just shaved 1500 lines off Pong. Furthermore, binary operations now only have to decrement SP once to access both operands, and they no longer have to increment it again at the end. Unary operations don't need to move SP at all. This single change has the largest impact in code size (apart from dead code removal I guess).
A bonus consequence of the above is that you can pre-emptively load the result of an operation onto D, so that the next operation can avoid re-accessing a value that was just stored. If your binary operations assume the first operand is already in D, then all you need to do is decrement SP once and perform the operation. Unary operations this way get reduced to a single line.
This is the most beautiful step, albeit the most difficult and time-consuming: interpreted optimization. It should be clear even if you choose to ignore this step that many instructions end up doing a bunch of redundant things. This is a natural result of having such an elegantly simple VM implementation. If you Google how some other VM languages work, you'll see they all have a lot more instructions to deal with. The idea here is to make your own pseudo-VM language, which takes in the base Hack VM as input and returns something slightly closer to Assembly, so that we can minimize those redundancies.
I first read about this here in the forums, and I debated for a while whether it would be worth the time to do this. After having done it, I can tell you that it definitely isn't. But damn if I'm not proud of the result.
The cool thing about implementing this is that you can let your creativity run wild to create your own language. It took me months of revising to finally arrive at something that is efficient and also intuitive. If you like, you can take a look at my interpreter's output for Pong. I will not explain how the language works. Try to figure it out!
What I will mention is the key observation that can lead you to your own version of pseudo-VM: what cadet1620 calls "memory-to-memory" operations. Essentially, you should notice that none of the elementary arithmetic/logical operations even need to touch the stack; they can be carried out entirely using the A, D and M registers as long as you can find a way to tell them who they should operate on.
Another feature of my interpreter which I haven't seen mentioned is identifying true void functions. This would probably be better implement in an optimizing compiler, but like I said I haven't gone that far yet. My objective was to optimize the VM translation of Pong as much as possible, so I implemented this here. Basically, you can search through every function call and if that function returns a constant 0 and next command is a pop temp 0, then this is a void function, and these two operations are completely useless.
Disclaimer: although this version of my VM translator has passed all the tests from projects 07 and 08, it hasn't successfully translated Pong yet. I have temporarily given up debugging until I actually implement the OS, since it's hard to debug when you have no idea what those functions are supposed to do. Just take this step with a grain of salt.
Optimization level 3: (Next steps)
I haven't implemented this yet since I wanted to make sure the above step worked completely before moving on. But the next step will be relative indexing. The idea is very simple: consecutive commands that access the same memory segment don't need to reference the base pointer more than once. Essentially, this would replace something like:
load local i
add local j
load local i
add up/down (i-j)
This should still be fairly readable, but shaves at least two lines from the output (potentially more if i and j are both large indices). If i-j is large, however, there should be some logic in place to judge whether the change is worth.
This is all I have for now. I hope this helps someone on their optimization journey or inspires others to embark in theirs. Optimizing something that already works is a frustrating but rewarding task. I ended up adding some functions to my own C++ library on the way, and also learned to implement some new data structures, so it was quite worth it.
I have built an FPGA based version of the NandToTetris computer with a tiny keyboard and a 480 by 320 TFT screen. It's working great, but I can't run any Jack code on it because the VM translator generates such bloated code. Would you be willing to share your translator with me, so that I could do something interesting with my machine? I would love to be able to run Pong...