Raycaster on the Hack platform

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Raycaster on the Hack platform

Acedio
Hi everyone :) I recently completed work on my chapter 9 project, which was a pseudo-3D game that utilizes raycasting. You can see a video of it in action at https://www.youtube.com/watch?v=c-J7lwKWDN8 and check out the code at https://github.com/Acedio/nand2tetris/tree/master/09/Bichromia.

It was fun spending time trying to optimize the critical loops and work around some of the idiosyncrasies of the Hack platform. I would have killed for a fast shift operation! I was mostly successful at avoiding multiplication and division, but it was necessary at a few points.

I think, though this runs fine on the VM emulator, that the Hack platform likely would not have enough program memory to run the assembly output by the VM translator. It would have to be tuned a bit. I think the most bytes could be saved by optimizing the lookup table initialization routines, which have a lot of redundancy in them.

Looking forward to seeing what other folks come up with on this platform :)
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

ybakos




                                            WOW


Thank you for sharing this.
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

ivant
In reply to this post by Acedio
You can't fool us! You're actually John Carmack, right?

Great stuff!
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

Acedio
Thanks guys :D I'm definitely no John Carmack; he would have gone the whole nine yards and found a way to implement Quake!
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

cadet1620
Administrator
In reply to this post by Acedio
This is quite impressive!   It even runs fast enough to be playable on my ancient, slow, home PC.

Acedio wrote
I think, though this runs fine on the VM emulator, that the Hack platform likely would not have enough program memory to run the assembly output by the VM translator. It would have to be tuned a bit. I think the most bytes could be saved by optimizing the lookup table initialization routines, which have a lot of redundancy in them.
You're quite right about the generated code size.  My compiler and optimized VM translator end up generating 61K assembly instructions.  As you thought, most of the code is in constant initialization; 32K for your constants and another 8K for the OS's font table.
Looking forward to seeing what other folks come up with on this platform :)
I wrote Floating Point and Trig libraries.  They're too slow to be of any use, but they show that one doesn't need floating point hardware.  Here's the Trig post; it links to the Floating Point post.

--Mark

My tool chain chewing on your code...
[D:/TECS/projects/student/09/Acedio/Bichromia]
% hjc .
Processing directory .
Processing .\Display.jack
.\Display.jack(282):
     Unreferenced arg Display.draw_col.frame
Processing .\Fixed.jack
Processing .\Luts.jack
Processing .\Main.jack
Processing .\Map.jack
.\Map.jack(403):
     Unreferenced var Map.trace.bitmask
     Unreferenced var Map.trace.dist_shift
     Unreferenced var Map.trace.addend	 

[D:/TECS/projects/student/09/Acedio/Bichromia]
% cp OS/*.vm .

[D:/TECS/projects/student/09/Acedio/Bichromia]
% hvm -v- ../Bichromia
Output to file ../Bichromia\Bichromia.asm
Processing directory ../Bichromia
Analyzing ../Bichromia\Array.vm
Analyzing ../Bichromia\Display.vm
Analyzing ../Bichromia\Fixed.vm
Analyzing ../Bichromia\Keyboard.vm
Analyzing ../Bichromia\Luts.vm
Analyzing ../Bichromia\Main.vm
Analyzing ../Bichromia\Map.vm
Analyzing ../Bichromia\Math.vm
Analyzing ../Bichromia\Memory.vm
Analyzing ../Bichromia\Output.vm
Analyzing ../Bichromia\Screen.vm
Analyzing ../Bichromia\String.vm
Analyzing ../Bichromia\Sys.vm
Processing ../Bichromia\Array.vm
Processing ../Bichromia\Display.vm
Processing ../Bichromia\Fixed.vm
Processing ../Bichromia\Keyboard.vm
Processing ../Bichromia\Luts.vm
Processing ../Bichromia\Main.vm
Processing ../Bichromia\Map.vm
Processing ../Bichromia\Math.vm
Processing ../Bichromia\Memory.vm
Processing ../Bichromia\Output.vm
Processing ../Bichromia\Screen.vm
Processing ../Bichromia\String.vm
Processing ../Bichromia\Sys.vm
93 Functions, 28 unused.
Unused functions:
  Display.dispose
  Fixed.copy
  Keyboard.readChar
  Keyboard.readInt
  Keyboard.readLine
  Map.at
  Map.dispose
  Map.print
  Math.max
  Math.min
  Math.sqrt
  Memory.poke
  Screen.clearScreen
  Screen.drawCircle
  Screen.drawConditional
  Screen.drawHorizontal
  Screen.drawLine
  Screen.drawPixel
  Screen.drawRectangle
  Screen.drawSymetric
  Screen.setColor
  Screen.updateLocation
  String.dispose
  String.doubleQuote
  String.eraseLastChar
  String.intValue
  String.setCharAt
  Sys.wait
  
[D:/TECS/projects/student/09/Acedio/Bichromia]
% hasm -l- Bichromia.asm | sed -n -e"/^VM Fun/,$ p"
VM Function code size

   32 (0x0020) Array.dispose             13310 (0x33FE) Luts.init_delta_dist
  120 (0x0078) Array.new                  7881 (0x1EC9) Output.initMap
  411 (0x019B) Display.clear              7866 (0x1EBA) Luts.init_cos
  292 (0x0124) Display.draw               6642 (0x19F2) Luts.init_height_from_dist
 3614 (0x0E1E) Display.draw_col           4651 (0x122B) Luts.load_level
  111 (0x006F) Display.fill_down          3614 (0x0E1E) Display.draw_col
  159 (0x009F) Display.fill_down_alt      2161 (0x0871) Main.main
  111 (0x006F) Display.fill_up            2130 (0x0852) Map.trace
  159 (0x009F) Display.fill_up_alt        1646 (0x066E) Luts.init_x_angle
  401 (0x0191) Display.new                 865 (0x0361) Map.new
  165 (0x00A5) Fixed.Add                   790 (0x0316) Map.update_pos
   32 (0x0020) Fixed.dispose               660 (0x0294) Math.divide
   41 (0x0029) Fixed.new                   593 (0x0251) String.setInt
    6 (0x0006) Keyboard.init               521 (0x0209) Map.trace_screen
   18 (0x0012) Keyboard.keyPressed         504 (0x01F8) Memory.alloc
  427 (0x01AB) Luts.init_bits              427 (0x01AB) Luts.init_bits
 7866 (0x1EBA) Luts.init_cos               411 (0x019B) Display.clear
13310 (0x33FE) Luts.init_delta_dist        401 (0x0191) Display.new
 6642 (0x19F2) Luts.init_height_from_dist   398 (0x018E) Math.multiply
 1646 (0x066E) Luts.init_x_angle           372 (0x0174) Output.create
 4651 (0x122B) Luts.load_level             348 (0x015C) Memory.deAlloc
 2161 (0x0871) Main.main                   341 (0x0155) Map.tb_mul
   14 (0x000E) Map.get_dest_level          324 (0x0144) Output.createShiftedMap
   33 (0x0021) Map.get_dest_x              292 (0x0124) Display.draw
   33 (0x0021) Map.get_dest_y              264 (0x0108) Map.load_rows
   16 (0x0010) Map.get_secrets             264 (0x0108) Output.drawChar
  147 (0x0093) Map.grab_bits               250 (0x00FA) Output.moveCursor
  264 (0x0108) Map.load_rows               216 (0x00D8) Output.printChar
  865 (0x0361) Map.new                     197 (0x00C5) Math.init
   71 (0x0047) Map.set_tile                183 (0x00B7) Screen.init
  341 (0x0155) Map.tb_mul                  165 (0x00A5) Fixed.Add
 2130 (0x0852) Map.trace                   159 (0x009F) Display.fill_down_alt
  521 (0x0209) Map.trace_screen            159 (0x009F) Display.fill_up_alt
  790 (0x0316) Map.update_pos              147 (0x0093) Map.grab_bits
   47 (0x002F) Math.abs                    146 (0x0092) Output.backSpace
  660 (0x0294) Math.divide                 128 (0x0080) Sys.error
  197 (0x00C5) Math.init                   127 (0x007F) String.new
  398 (0x018E) Math.multiply               125 (0x007D) String.charAt
  504 (0x01F8) Memory.alloc                123 (0x007B) Output.getMap
  348 (0x015C) Memory.deAlloc              122 (0x007A) Output.printString
   59 (0x003B) Memory.init                 120 (0x0078) Array.new
   26 (0x001A) Memory.peek                 111 (0x006F) Display.fill_down
  146 (0x0092) Output.backSpace            111 (0x006F) Display.fill_up
  372 (0x0174) Output.create               111 (0x006F) Sys.init
  324 (0x0144) Output.createShiftedMap     110 (0x006E) String.appendChar
  264 (0x0108) Output.drawChar              71 (0x0047) Map.set_tile
  123 (0x007B) Output.getMap                71 (0x0047) Output.init
   71 (0x0047) Output.init                  60 (0x003C) Output.println
 7881 (0x1EC9) Output.initMap               59 (0x003B) Memory.init
  250 (0x00FA) Output.moveCursor            57 (0x0039) Output.printInt
  216 (0x00D8) Output.printChar             47 (0x002F) Math.abs
   57 (0x0039) Output.printInt              41 (0x0029) Fixed.new
  122 (0x007A) Output.printString           33 (0x0021) Map.get_dest_x
   60 (0x003C) Output.println               33 (0x0021) Map.get_dest_y
  183 (0x00B7) Screen.init                  32 (0x0020) Array.dispose
  110 (0x006E) String.appendChar            32 (0x0020) Fixed.dispose
    8 (0x0008) String.backSpace             26 (0x001A) Memory.peek
  125 (0x007D) String.charAt                18 (0x0012) Keyboard.keyPressed
   16 (0x0010) String.length                17 (0x0011) Sys.halt
  127 (0x007F) String.new                   16 (0x0010) Map.get_secrets
    8 (0x0008) String.newLine               16 (0x0010) String.length
  593 (0x0251) String.setInt                14 (0x000E) Map.get_dest_level
  128 (0x0080) Sys.error                     8 (0x0008) String.backSpace
   17 (0x0011) Sys.halt                      8 (0x0008) String.newLine
  111 (0x006F) Sys.init                      6 (0x0006) Keyboard.init

Code size = 61214 (0xEF1E)
Data size =    29 (0x001D)

[D:/TECS/projects/student/09/Acedio/Bichromia]
%

Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

Acedio
Nice toolset!! That seems super handy for this kind of analysis. What kind of optimizations does your VM translator make? I thought it would be cool to make one that elided all the sequential push-pops to skip the stack, but I'm sure there are a lot more ways to minimize instruction count :)

I think that I could probably rework LUT.cc to output optimized VM code instead of going through the Jack compiler. There's also opportunity to reduce some of the redundancy (e.g. I not needing the full 256 of cos_lut) at the expense of some CPU cycles.

I'd never heard of CORDIC before, but I'm definitely going to have to take a stab at implementing it. I'm afraid it'll make me miss my shift operation even more, haha.
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

ybakos
Or, hell, write in just assembly like the good ol "demo" days.
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

cadet1620
Administrator
ybakos wrote
Or, hell, write in just assembly like the good ol "demo" days.
I do have a Hack OS written entirely in Hack assembly language.  It needs my Macro assembler that also supports external linkage. Way too complex to talk about on the forum. Email me if you are interested in more details.

Sample source:
// Copyright (C) 2011 Mark Armbrust.  Permission granted for educational use.
//
// File MathAsm.asm -- Assembly language implementation of the
// Hack VM Math class

INCLUDE VmMacros.inc

// RAM variables used by Math class
DATA    Math$dwReg.h, Math$dwReg.l, Math$loop, Math$mask
DATA    Math$prod, Math$x, Math$y, Math.divide$rem

//////////////////////////////////////////////////////////////////
//                                                              //
//      function int max(int x, int y)                          //
//                                                              //
// Return maximum of x or y.                                    //
//                                                              //
//////////////////////////////////////////////////////////////////

(Math.max)
    D=arg1
    A=arg2

    D=D-A               // compare
    @Math.max.retArg1
    D;JGE

// Following code shared by Math.min

(Math.max.retArg2)
    D=arg2
    ReturnD

(Math.max.retArg1)
    D=arg1
    ReturnD

The magic is all in the macros. For instance the D=arg2 macro is:
MACRO   D=arg2    
    A=&arg2
    D=M
ENDM
This macro uses the A=&arg2 macro:
MACRO   A=&arg1    
    @ARG
    A=M
ENDM
The generated code for Max ends up looking like this:
   10        (Math.max)
                 D=arg1
             +1     A=&arg1
   10     2  +2     @ARG
   11  FC20  +2     A=M
   12  FC10  +1     D=M
                 A=arg2
             +1     A=&arg2
   13     2  +2     @ARG
   14  FDE0  +2     A=M+1
   15  FC20  +1     A=M
             
   16  E4D0      D=D-A               // compare
   17    24      @Math.max.retArg1
   18  E303      D;JGE
             
             // Following code shared by Math.min
             
   19        (Math.max.retArg2)
                 D=arg2
             +1     A=&arg2
   19     2  +2     @ARG
   20  FDE0  +2     A=M+1
   21  FC10  +1     D=M
                 ReturnD
   22  ----  +1     @Vm..fnReturnD
   23  EA87  +1     0;JMP
             
   24        (Math.max.retArg1)
                 D=arg1
             +1     A=&arg1
   24     2  +2     @ARG
   25  FC20  +2     A=M
   26  FC10  +1     D=M
                 ReturnD
   27  ----  +1     @Vm..fnReturnD
   28  EA87  +1     0;JMP
Vm..fnReturnD is an EXTERN linkage to a routine that executes a VM return command.

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

cadet1620
Administrator
In reply to this post by Acedio
Acedio wrote
Nice toolset!! That seems super handy for this kind of analysis. What kind of optimizations does your VM translator make? I thought it would be cool to make one that elided all the sequential push-pops to skip the stack, but I'm sure there are a lot more ways to minimize instruction count :)
I optimize
  o  call tree is traversed and uncalled functions are removed,
  o  all push/pop sequences into moves,
  o  all push/dyadic-math into TOS op memory,
  o  all not/not and neg/neg pairs are removed,
  o  common code from earlier comparison, call and return statements is used for later statements.

Using common code is particularly effective for call statements. The second and following calls to a particular function require only 4 instructions:
    @$RIP$1234
    D=A
    @$call$Class.fn$3      [decorated with number of arguments]
    0;JMP
    ($RIP$1234)

Because calls are so small, they are quite effective, if a bit slow, for initializing constant arrays.

I've made a start at optimizing Luts just to see if reducing it to fit in 32K is remotely possible; it's looking good so far.  The first half of init_delta_dist ends up looking like this:
  function void init_delta_dist(Array lut, Array low_bits) {
	let arr = low_bits;	// static class variables
	let arr_n = 0;
	do Luts.setArrCnt(43, 2);
	do Luts.setArrCnt(11, 3);
	do Luts.setArrCnt(5, 4);
	do Luts.setArrCnt(2, 5);
	do Luts.setArrCnt(2, 6);
	do Luts.setArr(8);
	do Luts.mirrorArr(64);	// take advantage of symmetry
	do Luts.mirrorArr(128); 
The lut array looks like it has lots of steadily increasing values so a setArrInc(count, value, inc) function might be useful. I'll likely play a bit more with it this weekend.
I think that I could probably rework LUT.cc to output optimized VM code instead of going through the Jack compiler. There's also opportunity to reduce some of the redundancy (e.g. I not needing the full 256 of cos_lut) at the expense of some CPU cycles.
I'm not sure that there's much to gain by having LUT.cc write VM code. I think the best gain would be to have it write calls as above. At the VM level it's going to be hard to beat push/push/call.

Since cosine is a full cycle, you can init the first quarter and then use loops to reflect the data into the other 3/4.
I'd never heard of CORDIC before, but I'm definitely going to have to take a stab at implementing it. I'm afraid it'll make me miss my shift operation even more, haha.
I really miss right shift in the Hack computer. My float code does right shifting by left rotating and masking.
Also a PITA is bitwise XOR ...

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

cadet1620
Administrator
Running on both CPUEmulator and VMEmulator



It runs a bit slow to be playable on the CPU emulator, but I was able to navigate into the maze without too much trouble. It is running slow enough that it wasn't too hard to get the shot with the door half closed.

Bichromia.hack is 31489 instructions.

Here's my Luts.jack.  The Array setting code was generated by a Python program that parsed your Luts.jack and output the calls.  (It seemed easier to write a new Python program that to port your .cc code to the ancient MS C++ compiler I have on my home PC.)

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

Acedio
This post was updated on .
cadet1620 wrote
Bichromia.hack is 31489 instructions.
Nice!! Was that all from just condensing Luts.jack and removing dead code? Proceduralizing (if that's a word, ha) the LUT initialization was a good idea.

I tried running it on the CPU Emulator and, while slow, it was indeed playable. Increasing the player move/turn speed would definitely put it more into the realm of enjoyability ;)

cadet1620 wrote
Here's my Luts.jack.  The Array setting code was generated by a Python program that parsed your Luts.jack and output the calls.  (It seemed easier to write a new Python program that to port your .cc code to the ancient MS C++ compiler I have on my home PC.)

--Mark
Yep, C++11 features are used for a couple things (class enums and the for-each type loop) in LUT.cc. It used to be C before I started missing the standard library :)

I'm kind of toying with playing around with the NES a bit after my experience with Hack. Know of folks that have written Hack VMs for other platforms?

cadet1620 wrote
My float code does right shifting by left rotating and masking.
How does this work? I ended up right shifting by looping through and testing bits, haha.

Thanks for all your feedback on this, Mark :)

--Josh
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

cadet1620
Administrator
Acedio wrote
I'm kind of toying with playing around with the NES a bit after my experience with Hack. Know of folks that have written Hack VMs for other platforms?
Here's the only one I recall seeing. VM translator generate x86 windows assembly

cadet1620 wrote
My float code does right shifting by left rotating and masking.
How does this work? I ended up right shifting by looping through and testing bits, haha.
    function int ROL1(int x) {
      // rotate left 1 bit
      var int c;
      if (x < 0) {
        let c = 1;
      }
      return x+x+c;
    }

    function int SHR(int x, int n) {
      // unsigned shift right n bits
      // [needs bounds checking on 'n'.]
      let n = 16-n;
      while (n > 0) {
        let x = ROL1(x);
        lte n = n-1;
        }
      return x & ~leftMask(n);    // (leftMask returns word with n leftmost bits set)
    }
Signed right shift needs to save the sign bit, then SHR, then if the sign bit was set was set or the result with leftMask(n) to properly extend the sign bit.

Your bit-testing loop is probably a bit faster. In my Float code I'm doing this with 32-bit double words so rotate can optimize ROL(16) into a word swap.

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

jamesleibert
In reply to this post by Acedio
I also did a raycaster for my Project 9 on Coursera. The video is at: https://youtu.be/inFJ5EyOhpM
Definitely agree with the comment about lack of a right shift operator! It would make a huge difference to add it to the platform.
I'm keen to know how you managed to eliminate the integer overflows in calculating the wall distances - it took more time than all the rest of the programming combined!
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

ybakos
WWWWWWOOOOOOWWW!
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

cadet1620
Administrator
In reply to this post by jamesleibert
I need to do an optimization pass on my Float code. Left rotation can be improved a bit by removing the if and using the fact that (x < 0) is either 0 or -1.
function int ROL1(int x) {
    // rotate left 1 bit
    return (x + x) - (x < 0);
}
And since ROL1 is now just a single line, it should be used inline in SHR instead of calling ROL1() which is a slow operation in Hack.


Reliably detecting integer overflows in Jack is a PITA. One thing that can help is this trick to do unsigned compares in Jack.
if ((x+INT_MIN) > (y+INT_MIN)) {
    // unsigned x > unsigned y
INT_MIN is a class static variable initialized to -32768.

This works by transforming the range of the unsigned values from [0, 65535] to [-32768, 32767] so Jack's signed compare will return the correct result.

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

jamesleibert
Perfect! It's hard to imagine beating that.

I was looking at something like this for SHR(x, n):

let i = 1;
let j = twoToThe[n];
let k = 16-n;
let sum = 0;
while (k > 0) {
    if (~((x & j) = 0)) {
        let sum = sum + i;
    }
    let i = i + i;
    let j = j + j;
    let k = k - 1;
}
return k;

Your version is more elegant and faster - I will use that!

Unsigned compare is neat too. It will save a little bit of code.

The main problem I've been having with integer overflow is predicting whether a multiplication will result in an overflow before you do it.

Hacker's Delight suggests either doing the division, or alternatively using Number of Leading Zeros (NLZ) to identify when overflow is a risk. I originally tried the NLZ approach, but there were too many edge cases, and the NLZ overhead was too big.

The solution I ended up with has a table lookup for int(32767 / n) for the values of n that could conceivably lead to overflow (fortunately there are only 82 in total (when int(64*tan(2*pi*i/4096)) > 512 for 0 <= i <= 1024). It's all rather ugly and adds yet another table to an already bloated initialisation.
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

cadet1620
Administrator
[FYI, I was one of your reviewers on Coursera.]

Are you looking for signed or unsigned multiplication overflow detection?

The cleanest way to detect the overflow would probably be to write your own multiply() that sets a class static  boolean mult_overflow. In the VM emulator, this will likely be slower than the table lookup and Jack "*" which calls the built-in Math.multiply() which does the multiply in Java code.

In project 12 -- writing the OS -- you will be writing Math.multiply() that is used for Jack "*" operator, so you can get a jump on that.

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

jamesleibert
In reply to this post by jamesleibert
I've put the source code on git hub.
https://github.com/QuesterZen/hackenstein3D

Comments, suggestions and improvements are welcome.

I plan to add some further optimisations suggested by Mark Armbrust, which should improve the frame rate and make it more playable.

I'll also try to track down the little rendering glitch that sometimes appears, I'm struggling to identify what's causing it.
Reply | Threaded
Open this post in threaded view
|

Re: Raycaster on the Hack platform

jamesleibert
In reply to this post by cadet1620
Thanks for reviewing! I completed the course last week (the project grade was the final requirement). I've already improved the program a lot since the version you saw - it's more than twice as fast and a lot less glitchy.

The OS should definitely have an overflow flag for multiply and divide!

I think all I'm left with now is 4 multiplications for each of the 256 rays (one for each of the vertical and horizontal intercepts and two for the x-axis component of the intercept rotation) plus approximately 8 right-shift-6's (which I will change to your method and hopefully get a bit more speed as a result).