Acedio wrote
Nice toolset!! That seems super handy for this kind of analysis. What kind of optimizations does your VM translator make? I thought it would be cool to make one that elided all the sequential push-pops to skip the stack, but I'm sure there are a lot more ways to minimize instruction count :)
I optimize
o call tree is traversed and uncalled functions are removed,
o all push/pop sequences into moves,
o all push/dyadic-math into TOS op memory,
o all not/not and neg/neg pairs are removed,
o common code from earlier comparison, call and return statements is used for later statements.
Using common code is particularly effective for call statements. The second and following calls to a particular function require only 4 instructions:
@$RIP$1234
D=A
@$call$Class.fn$3 [decorated with number of arguments]
0;JMP
($RIP$1234)
Because calls are so small, they are quite effective, if a bit slow, for initializing constant arrays.
I've made a start at optimizing Luts just to see if reducing it to fit in 32K is remotely possible; it's looking good so far. The first half of init_delta_dist ends up looking like this:
function void init_delta_dist(Array lut, Array low_bits) {
let arr = low_bits; // static class variables
let arr_n = 0;
do Luts.setArrCnt(43, 2);
do Luts.setArrCnt(11, 3);
do Luts.setArrCnt(5, 4);
do Luts.setArrCnt(2, 5);
do Luts.setArrCnt(2, 6);
do Luts.setArr(8);
do Luts.mirrorArr(64); // take advantage of symmetry
do Luts.mirrorArr(128);
The lut array looks like it has lots of steadily increasing values so a setArrInc(count, value, inc) function might be useful. I'll likely play a bit more with it this weekend.
I think that I could probably rework LUT.cc to output optimized VM code instead of going through the Jack compiler. There's also opportunity to reduce some of the redundancy (e.g. I not needing the full 256 of cos_lut) at the expense of some CPU cycles.
I'm not sure that there's much to gain by having LUT.cc write VM code. I think the best gain would be to have it write calls as above. At the VM level it's going to be hard to beat push/push/call.
Since cosine is a full cycle, you can init the first quarter and then use loops to reflect the data into the other 3/4.
I'd never heard of CORDIC before, but I'm definitely going to have to take a stab at implementing it. I'm afraid it'll make me miss my shift operation even more, haha.
I really miss right shift in the Hack computer. My float code does right shifting by left rotating and masking.
Also a PITA is bitwise XOR ...
--Mark