Generated code size

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

Dano
This is similar to something I am playing with right now. I am going to extend Jack language to allow initializer for static const data. Huge portion of the "Output" class is dedicated to the simple task of coping with the Harvard architecture limitations.

More on the topic of vm2asm code generation, I assume we will soon reach the limits of squeezing the code. I think that improvement could only lie in extending, or even redefining the intermediate representation -- vm. I am researching some potential techniques, like redefining the true constant, introducing short circuit evaluation, "fastcall" calling convention using registers instead of stack, code inlining and lots of other interesting stuff, which goes far beyond the book.

The floating point code that was recently posted by Mark clearly shows that there _is_ a need for a better, optimizing toolchain.
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

cadet1620
Administrator
Also of interest might be this thread describing an architectural change that adds more registers to the CPU and helps alleviate the A-register bottleneck.  Might be interesting to combine it with my ROM access mod.

--Mark
dlk
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

dlk
This post was updated on .
In reply to this post by Dano
Hi Dano,

Inspired by your example and your suggestions, I made some enhancements to my VM translator.
I'm presently getting very similar results to yours: 17483 instructions for the OS, and 21889 instructions for the Pong example in project 11.

My VM translator uses an intermediate representation slightly closer to Hack than the original VM.
It's the intermediate representation that I optimize first, and later do as best as I can translating each
intermediate VM instruction into Hack assembly. I've put a printout of the intermediate VM produced for the Pong example, which should give some impression of the optimizations that I do at the IVM level (mostly similar to yours), at http://pastebin.com/gQsXrmAp .

Enjoy!

[This isn't in the scope of any exercise of the course, so I hope it's OK that I pasted this.]
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

Dano
You beat me, sir! But only by a bit.

What I like about your solution is that you have put some thoughts into it before writing the code and I bet your code is much cleaner than mine, which is just a dirty hack over a dirty hack.

If we are turning this into competition, we should perhaps start another thread in chapter 13 in order to not confuse another readers. Moderators, any thoughts?
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

cadet1620
Administrator
I agree that using an intermediate meta-language is inspired. I've been wondering about common code optimization. One of the compilers for a very small microprocessor I've used can do incredible common code optimization. It's a bit of a balancing act, though, since it requires a call to use the common code and the call stack on this processor is a very limited resource (32 levels).

Common code detection for Hack is tough because there are no relative jumps and you don't know if the @ argument is a code target or a constant. This meta language idea is perfect for this.  Instead of VM translating into Hack, translate into an assembly meta-language that includes relative jumps and calls and do the common code optimization in that language before generating the machine code.

I'm way too busy to look at this at the moment, but I'll be thinking about it.

[I'll see if there's a way to move this into project 13; it's definitely a better fit there.]

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

David Bar
In reply to this post by cadet1620
Hi

I'm not sure if you guys did this already, but I didn't see you mentioned it in this thread.

I've implemented a simple way of removing dead code.
I first parse all the vm files, and during parsing keep tabs of declared functions, and how many times they were called. I also mark on each vm instruction to which function it belongs.

Before continuing to emitting the asm file, I look for functions without calls.
If there's such a function, I add it to unused functions list, and then go over the instructions of that unused function, look for "call"  inside it, and lower the reference count for the functions being called from the unused function.
This is an iterative process, as now other functions may have a ref count of 0, which also require reducing the ref count of other functions, and so on. I simply continue until I see I didn't add any new functions to the unused list at this round of the loop.

When done, I just need to remove the VM instructions belonging to functions in the unused list.

I'm too ashamed to give my exact line count for pong+OS, but my line count for the asm file (including comments and labels) was reduced by about 40% after the optimization.

For my crappy code, this was the difference between fitting into the ROM or not.

What this very simple dead code detection method doesn't handle are simple recursion and cases of two (or more functions) calling each other.
For recursion, this can be easily solved by not adding a ref count to a function if it's calling itself. For the A <--> B calls, I guess a proper graph algorithm is needed, which is beyond my capabilities.


I now have have to go and do actual optimizations to my VM compiler... :)
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

cadet1620
Administrator
Instead of identifying uncalled functions, how about starting with Sys.init and identifying called functions? Anything that isn't in the called function graph is dead code.

I haven't played with dead code removal at all; sounds like a good thing to do over the holidays.

--Mark
gav
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

gav
In reply to this post by cadet1620
cadet1620 wrote
As a first idea where to look to improve your VM translation, here is the breakdown of the number of various VM commands used in the OS.
@cadet1620

I found your list of OS VM command frequencies and was inspired to check the number of instructions needed for each command. The figures are specific to my own implementation of course. I should state that this is for a straight forward implementation based on the design suggestions in Chapters 7 and 8. No optimisation tricks.

This information was relatively easy to gather, as I had pushed the ASM implementation of each VM command into an array before writing to file.

Command Type       Count   Inst  Inst%  Inst/Comm
-----------------+------+------+------+----------
total               3927  37719  100.0        9.6
C_PUSH              2177  14793   39.2        6.8
  constant >1        950   5700   15.1        6.0
  indirect           577   5193   13.8        9.0
  constant 0,1       539   3234    8.6        6.0
  PUSHnPOP           132   2397    6.4       18.2
  direct             111    666    1.8        6.0
C_CALL               232  10168   27.0       43.8
C_ARITHMETIC         451   4495   11.9       10.0
  eq,gt,lt           122   2440    6.5       20.0
  add,sub,and,or     267   1869    5.0        7.0
  neg,not             62    186    0.5        3.0
C_POP                535   4457   11.8        8.3
  indirect           198   2772    7.3       14.0
  direct             337   1685    4.5        5.0
C_RETURN              59   2773    7.4       47.0
C_IF                  93    465    1.2        5.0
C_FUNCTION            58    294    0.8        5.1
C_GOTO               114    228    0.6        2.0
Init                   1     46    0.1       46.0
C_LABEL              207      0    0.0        0.0

A couple of notes:
 - This list is sorted numerically descending by total Instructions used per Command Type (the Inst column).
 - C_LABEL is a pseudo-command, it doesn't generate any instructions once assembled (I account for this).
 - The total number of instructions is indeed > 32K in this implementation.
 - Some command types have been broken down and grouped (like eq,gt,lt which all generate almost identical code).
 - PUSHnPOP is a count of how many times a push is immediately followed by a pop (only accounts for 6.4% of all instructions).
 - Clearly C_CALL with 43.8 Instructions per Command, and 27% of all instructions is the first optimisation candidate.
 - Although C_RETURN is expensive to implement at 47 Instructions per Command, it is only 7.4% of all instructions.
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

krinnert
In reply to this post by Dano
After a break of a few months I came back to do project 8.

It soon became clear -- after reading some forum posts -- that some VM translator optimization was needed if I ever wanted to compile everything to HACK and run it in the CPU emulator (don't want to miss out on that satisfaction!).

So I tried to do the usual stuff (ideas posted in this thread already):
- eliminate redundant SP updates and fetching operations (push/pop, push/binary op, push/unary op)
- merge if-goto with eq/gt/lt when possible
- outsource call and return to common code (assembly level function)

I have to confess I didn't think about the if-goto merges before reading the forum. :)

The outcome is:  20707 HACK instructions for the plain OS. Not too shabby, I think.

Right now I don't see how I can squeeze out more without breaking the VM contract. Sure enough, being in control of the whole compilation chain allows for further optimization. As usual, abstraction is selective ignorance.

Anyway, just wanted to post another optimization result for comparison. I'm looking forward to the remaining projects. Great course! Reminds me of a lot of things I sort of knew but never thought about in such clarity.
 
Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

cadet1620
Administrator
We had a big snow storm come through last weekend so I used that as an excuse to stay in all day. I finally got around to adding dead code analysis to my VM translator. Here's the improvement I got:

ProjectBeforeAfterReduction
Pong24304197274577-18.8%
Reflect26646242652381-8.9%
Float30021244275594-18.6%

Reflect is my game. Float is my floating point class and sample program. Pong and Reflect were translated using the supplied OS. Float was translated using my OS since it's too big using the supplied OS without dead code removal.

Here's the list of functions removed from Pong:

Keyboard.readChar
Keyboard.readInt
Keyboard.readLine
Math.max
Math.min
Math.sqrt
Memory.poke
Screen.drawCircle
Screen.drawConditional
Screen.drawHorizontal
Screen.drawLine
Screen.drawPixel
Screen.drawSymetric
String.dispose
String.doubleQuote
String.eraseLastChar
String.intValue
String.setCharAt
Sys.wait

--Mark

Reply | Threaded
Open this post in threaded view
|

Re: Generated code size

krinnert
I couldn't sleep last night, so I sought some more inspiration on optimizing the VM to ASM translation and found it. (I can only speculate what's the cause and what's the effect).

So now I have about the same numbers as cadet1620, which tells me it's time to stop and move on to project 9. (In fact I already started on a little game in jack; nice little language).

All programs are linked against the supplied OS vm files. Numbers are with (without) dead code removal. Correctness was confirmed by running all resulting hack files in the CPU emulator. Here are the numbers:

Pong: 19614 (23743)
Square: 16939 (21258)
Average: 16212 (20658)
ComplexArrays: 19131 (24531)
OS: 13904* (19296)

* admittedly, dead code removal for the naked OS doesn't make much sense.



12