Acedio wrote
Nice toolset!! That seems super handy for this kind of analysis. What kind of optimizations does your VM translator make? I thought it would be cool to make one that elided all the sequential push-pops to skip the stack, but I'm sure there are a lot more ways to minimize instruction count :)
 
I optimize
  o  call tree is traversed and uncalled functions are removed,
  o  all push/pop sequences into moves,
  o  all push/dyadic-math into TOS op memory,
  o  all not/not and neg/neg pairs are removed,
  o  common code from earlier comparison, call and return statements is used for later statements.
Using common code is particularly effective for call statements. The second and following calls to a particular function require only 4 instructions:
    @$RIP$1234
    D=A
    @$call$Class.fn$3      [decorated with number of arguments]
    0;JMP
    ($RIP$1234)
Because calls are so small, they are quite effective, if a bit slow, for initializing constant arrays.
I've made a start at optimizing Luts just to see if reducing it to fit in 32K is remotely possible; it's looking good so far.  The first half of init_delta_dist ends up looking like this:
  function void init_delta_dist(Array lut, Array low_bits) {
	let arr = low_bits;	// static class variables
	let arr_n = 0;
	do Luts.setArrCnt(43, 2);
	do Luts.setArrCnt(11, 3);
	do Luts.setArrCnt(5, 4);
	do Luts.setArrCnt(2, 5);
	do Luts.setArrCnt(2, 6);
	do Luts.setArr(8);
	do Luts.mirrorArr(64);	// take advantage of symmetry
	do Luts.mirrorArr(128); 
The lut array looks like it has lots of steadily increasing values so a setArrInc(count, value, inc) function might be useful. I'll likely play a bit more with it this weekend.
I think that I could probably rework LUT.cc to output optimized VM code instead of going through the Jack compiler. There's also opportunity to reduce some of the redundancy (e.g. I not needing the full 256 of cos_lut) at the expense of some CPU cycles.
 
I'm not sure that there's much to gain by having LUT.cc write VM code. I think the best gain would be to have it write calls as above. At the VM level it's going to be hard to beat push/push/call.
Since cosine is a full cycle, you can init the first quarter and then use loops to reflect the data into the other 3/4.
I'd never heard of CORDIC before, but I'm definitely going to have to take a stab at implementing it. I'm afraid it'll make me miss my shift operation even more, haha.
 
I really miss right shift in the Hack computer. My float code does right shifting by left rotating and masking.
Also a PITA is bitwise XOR ...
--Mark