[Edited to correct link.]
Modern hardware compilers are good at optimizing away unused and unnecessary logic. For example Mux16(a=false, b=whatever, ...) will most likely be reduced to 16 And gates. Unused bits in an adder will also be eliminated. This means that you can use Add16(a[0..3]=..., b[0..3]=..., out[0..3]=...); when you need a 4 bit adder and assume that the logic for the remaining 12 bits will not be generated. This will let you focus on the overall structure of your multiplier rather than the busy-work of writing a bunch of various width adders.
Also note that there is duplication in each stage of the multiplier. You can combine the Ands and the Adder into a new part and use that new part in each stage.
There has been a lot of research on optimizing parallel multipliers. These will get you started it you want to look into it
https://en.wikipedia.org/wiki/Wallace_tree https://en.wikipedia.org/wiki/Dadda_multiplierNote that most hardware multipliers generate the double length result. For Hack, that would mean you would need to add a register to hold the high result and instructions to access it, or have 2 instructions: one that returned the low word and one that returned the high word.
Another application of parallel addition similar to that used in multiplication is bit counting, sometimes called POPCNT (population count). This circuit/instruction computes the number of one bits in its input.
The obvious implementation is to build a tree of adders. For 16-bit count:
8 1-bit adders generate 8 2-bit numbers that need to be summed.
4 2-bit adders generate 3 3-bit numbers that need to be summed.
2 3-bit adders generate 2 4-bit numbers that need to be summed.
1 4-bit adder generates the final 5-bit population count.
That requires 11 full-adders and 15 half-adders
My size-optimized 16-bit popcnt uses 11 full-adders and 4 half-adders.
--Mark