Small Optimizations for Screen.jack

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Small Optimizations for Screen.jack

parkoo
I know that optimization is technically out of the scope of project 12, but it's frustrating when you load up Pong with your version of the OS and it runs sooooo slooowwwlllyyyyy. Here are a couple tricks that helped me speed up my version of the screen driver:

1. I needed to calculate x % 16 several times to find the position of a pixel within a word. NaƏvely, I did it like this:
 
   x % 16 := x - (16 * (x / 16))

Because division truncates on hack, this returns the difference between x and the next-lowest multiple of 16, aka x % 16. This has the advantage of working for any modulo divisor, but it's very slow: calls to the multiply and divide functions have an enormous amount of overhead and should be avoided. Unfortunately, this computation needs to be done for almost every operation in Screen.jack. The tricky way to get around it is to use a mask, like this:
   
    x % 16 := x & 15

This works because we want a value between 0 and 15, which, conveniently, is represented by the last four bits of a binary number. This is very fast (and is a basic instruction for Hack's ALU), but it only works for powers of two. It might help intuition to compare it to the decimal equivalent. Suppose that we want x % 10 in decimal. The easiest way to do this is just to look at the last digit:

    9 % 10 = 9;  46 % 10 = 6; 387 % 10 = 7; etc.

Similarly, modulo 100 looks at the last two digits:

    165 % 100 = 65; 6726 % 100 = 26; etc.

So the rule is, in base b, x modulo b^n is the same as the least significant n digits of x. In base 2, as above, we want modulo 16, aka 2^4. Using 15d = 1111b as a mask means that we cut off all but the least significant four bits (digits) of the dividend, effectively taking the modulo.

2. The second trick to avoid multiplication and division calls is to use a look-up table for screen memory addresses instead of computing them live. This was inspired by Michael Abrash's Graphics Programming Black Book. The calculation for the word in memory in which pixel (x,y) falls, where 0 <= x <= 511 and  0 <= y <= 255, is:

    address = screen_base + (x / 16) + (y * 32)

Since this computation happens all the time in the screen module of the OS, and it requires a call to Math.multiply and Math.divide every time, it's a tremendous drag on performance. It's much better to pre-compute the values in the initialization function for the screen driver and store them in an array so that they can be looked up when they're needed.

At first, I thought of making a two dimensional array with one entry for each pixel, but this would have 256 * 512 entries,  which is 2^16, or 131072. Way too many! An alternative is to have two arrays, one for y and one for x. The y array stores the "segment" of the address, and the x array stores the "offset." Each row of pixels is 32 words wide, and each word contains 16 pixels. The y array has 256 entries, the multiples of 32 from 0 * 32 to 255 * 32:

    0, 32, 64, 96,..., 8096, 8123, 8160.

The x array has 512 entries: 16 repetitions of 0, followed by 16 repetitions of 1, and so on up to 31. Since every 16 pixels share an address, we want the integer part of x / 16 from x = 0 to x = 511. The array looks like this:
 
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31.

With two arrays we only take up 768 words of memory, fewer than the bitmaps in Output.jack. Much better than 131072! Now, to find the address of point (x,y), all we have to do is add the array entries:

     x_array[x] + y_array[y]

All good? Well actually, no: we forgot the screen base offset in memory; Hack's screen's memory map begins at RAM[16384]. You could add 16384 (aka 0x4000) to the address every time, but I found it easier just to add it to each entry of one of the arrays. (I added it to the y array, but you could add it to the x array instead. Only add it to one array, though!)
   
With these improvements, particularly the look-up table for the memory map, I saw a massive improvement in my OS's performance on the pong game: it was so fast that I had to turn down the VM emulator's speed to make it playable! I'm sure that these improvements are obvious to people who are experienced with optimization and software development, but I was amazed at how much of a difference such simple changes made. In what ways did you get your OS to run a little bit more smoothly?
Reply | Threaded
Open this post in threaded view
|

Re: Small Optimizations for Screen.jack

Lozminda
I stopped at chapter 9 (not because I wanted to, but other projects), anyway reading your post, very cool. If I ever get the chapter 12, I'll bare your optimisations in mind.

I know from other languages % and divide are notoriously slow, anything you can do not to do them is a could idea. And masking bits seems to be the gold standard in speed !

Good stuff !

Lozminda