As the title say, I made a raytracer (more precisely, a raymarcher) in Jack.
I know that a similar program already exists (check the "Cool Stuff" on the main N2T page), but I wanted to make my own, and in a different way.
In a nutshell, I implemented my own version of the IEEE 754 floating point representation and many required operators (addition, multiplication, square root, etc.) and then built a signed distance function driven raymarcher around it, with an ordered dithering in the end to transform the pixel intensity to binary.
The entire render time was a little less than 10 hours. :D But I have two Python implementations as well for the same setup: one that uses built-in float numbers, the other uses my custom created float type.
There you go:
It requires much less ROM space than I expected. Running it through my toolchain I came up with the following results:
1. Size optimized (significantly slower than the speed version)
a. Purely VM-based (no ASM OS replacements):
r_min.hack 7426 instructions
b. ASM replacements for mul/div/Memory
r_min_asm.hack 7090 instructions
2. Speed optimized
a. Compatible with official emulator (including ASM OS replacements)
r_asm.hack 17101 instructions
b. FPGA version (slightly faster, but doesn't work in the (unpatched) official emulator due to the bug with A=0;JMP, where the emulator falsely jumps to 0 instead of the previous address in A. It does work in the Hardware Simulator though, but you will need a lot of time :-) )
r_soj.hack 16985 instructions
I didn't test yet how long it will take to render the full screen, but you can see the pixels being slowly generated.
Actually I forgot to switch on the strength reduction which replaces mul/div with powers of 2 by SHL and SAR (not 100% accurate due to different rounding for negative values for DIV vs SAR, but should be fine in this context). Also there is a more aggressive code elimination switched on in this version, which ultimately removes the whole Output class (the latter option already activated in the above size optimized versions).
Again 2 versions:
1. Speed optimized for official emulator
r_shift.hack 13577 instructions
2. FPGA version which would require the bugfix in the emulator to run
r_shift_soj.hack 13423 instructions
Thank you for the analysis! I am surprised too on the relatively small numbers. :)
I didn't do any ROM or RAM usage tests. I assumed if I ever overstep my boundaries the simulator would inform me (or something would go horribly wrong visually).
Speaking of RAM, I was very casual with creating static variables. A lot of temporary variables that required memory alloc was allocated at init time, and thus they were defined as static variables on class level. This is quite wasteful; because a lot of them truly are there for one or more instructions, and could be reused elsewhere at other parts. But on the other hand, I didn't want to deal with possible memory collisions.
Another wasteful part is how I represent float numbers. I allocate an entire 16-bit wide integer for a single sign bit, and another integer for the exponent, which never grows beyond the hundreds in magnitude. The two could be stored in a single int, but then you'd need extra instructions to separate them in every float related calculations.
I did a quick investigation with cProfile on the Python code with my custom float design. It seems that float addition was the most frequent function call (79,585,262 times!), followed by multiplication (39,208,095 times), substraction (27,256,770 times) and division (16,984,750 times). This does not count Python integer multiplications though.
memory_profile showed me this plot:
Yeah, constant. :D The minor ups and downs I assume is because of the Python variables, not my manual allocs.
I think the RAM usage on the heap is negligible (there are not too many objects). It's about 2k words plus whatever the OS implementation requires. For statics there are about 100 slots left (again depending on the OS implementation).
I quickly tested the shift version with my own emulator. Looks equivalent, though I didn't compare it pixel by pixel ;-)
On a side note: The assumption that the VM emulator will inform you about boundaries is not entirely true.
Regarding ROM usage, the VM emulator basically has no real boundaries. Since the VM files don't rely on any absolute addressing, and all jumps are referenced by a name, it will execute programs which are too big to be translated into Hack machine code. Even if addresses in the program code would result in >32k.
Secondly the RAM usage may be understated if you execute the program without OS class files in the directory. This results in the emulator using its (much faster) internal OS routines which don't occupy RAM for their internal data (i.e. the Output class doesn't store the charset in the heap RAM) and also uses a slightly different memory structure in some cases:
Strings are stored as one Array, containing capacity, length and the actual string. The OS implementation (and all implementations that stick 100% to the API conventions) will create 2 heap objects: 1. the object itself containing the capacity, length and a pointer to an array containing the characters and 2. the mentioned array. This is because the constructor will call Memory.alloc with the required size calculated at compile time (typically 3), while inside the constructor, the Jack code will dynamically call Array.new (or Memory.alloc) to create an array with the required size at runtime.
Not recommended, but possible for your own OS implementation: This behavior could be emulated by using a static function as a fake-constructor (function String.new instead of constructor). This would prevent the compiler generating the creation of the object on the heap and the function would generate and return one Array containing all data as described above. It saves a few words of heap RAM when using lots of strings...
Yeah, I suspected that in terms of ROM the emulator has no boundaries. Otherwise the OS itself would occupy most of it (and it would make no sense to use an internal OS that requires no ROM on the Hack, yet they constrain the size of your program to have at most 32k instructions).
Hm, I never knew how the emulator works, thanks for the info!
That green image looks awesome. :) What was the runtime approximately?
With my own emulator @200MHz it took around 16 minutes with the FPGA shift version.
The official CPU emulator will take quite long, I can only estimate based on a comparison with the VM emulator. The Shift version is about 6x slower, the FPGA version on the patched emulator around factor 5.5. Which would bring your 10 hours on VM up to about 55 hours...
I estimated once that the CPU emulator on a modern machine using JDK8 is running at about 1MHz - this is in line with the above numbers. The actual speed of the official emulators depends on the machine and JDK, there is no synchronization with the system clock.
This puts an interesting perspective on things when you consider that the Hack running at 1 MHz is actually roughly comparable to the computing power of machines from the early- to mid-1970s such as the original PDP-11/20.
These kinds of times were pretty typical of doing this kind of stuff. Makes the work that Industrial Light and Magic did with Star Wars at about that same time all the more astonishing.
It shouldn't be critical to make the emulator available. However, it is a bit of a work in progress that I stopped working on about 2 years ago when I was moving to a FPGA implementation of Hack. You may have seen in the screenshot that the emulator runs in "Interpreter mode". In principle it also supports dynamic compilation of ROMs and would then run much faster (I achieved up to 2 GHz on my machine which was running at 3.6-4MHz). But the compilation relies on a static analysis of jump targets (e.g. it assumes that indirect jumps like ...A=M, 0;JMP ... will target an instruction following conditional or unconditional jumps). While at that time it worked fine for the ROMs created by my toolchain (only using indirect jumps for call/return), it fails with most of my current ROMs (i.e. calculated indirect jump targets for very fast shift routines). As a consequence I had to deactivate this feature for the time being. Probably I should fix this and then release it... this could be done by either catching jumps to unknown locations and trigger a recompilation while temporarily switching back to interpreted mode - or by assuming that jumps can go to any target which would make the emulation slower. Let's see if I find time for this. The current jar file with pure interpretation should be no problem though.
As much as I would like to release the optimizing toolchain as well, I believe this would be against the idea of the nand2tetris project. Developing a toolchain/a VM translator and thinking about some basic optimizations is an essential part of the learning process. However, I'm always happy to provide translated binaries generated by my toolchain for interesting projects. This can serve both as a benchmark and also bring some more fun to those who implemented Hack on a FPGA board.
Inspired by your program, I had a look into my emulator again and added a quick and dirty fix to the dynamic compilation (now assuming that all addresses can be targets of jumps which is not the ideal solution, but the easiest way). As a result the compilation mode is now working again with all ROMs at the cost of slower execution, but still much faster than the interpretation. Your raytracer now ran at about 800MHz-1GHz and completed in only 3:45 minutes. I'll try to document it a bit and to release it on Github next week or so.