It is important to note the difference in scale. A top end processor these days can have more than 10 billion (10^10) transistors. A naive implementation of the Hack CPU can be built with less than 5000. (
1186 Nands) For comparison, the Intel 8080, a much more complex CPU albeit only 8 bits wide had 3500 transistors.
For nand2tetris, the hardware is on a very small scale. There are few opportunities for optimization at this level.
invin wrote
1) How can you generalize this result without looking at the instructions to be executed?
The hardware is always there; as long as it has power, it is always computing something, possibly garbage values like the ALU in the Hack CPU during A-instructions[1].
Turning power on and off is a very slow operation and is especially tricky for sequential circuits because they generally need a power-on-reset so that they come up in a determined state. Power control like this is generally only done for portions of a processor that are not tightly interconnected, for instance integrated peripherals can have simple enough interfaces to the processor's I/O system and memory to be so controlled.
This makes the practical trade off choosing between duplicating sub-circuits or sharing them.
As Ivan pointed out, wiring is a problem. In software, a function can be called from 100 places in the code and there is no additional cost to the function's code; just the incremental cost for each call.
In hardware, the inputs from each "caller" need a wire routed to the shared circuit, through some additional circuit that controls which of those wires is logically connected (mux, 3-state driver, etc.) The shared outputs also need to be larger to be able to supply more power to overcome additional load. More output wiring also means more capacitance which results in slower operation.
These tradeoffs usually make it impractical to share circuitry as small as adders and Ands.
2) Processors these days provide ILP. And Superscalars also do "out of order execution".
ILP and reordering aren't new concepts. Back in the '80s there were mainframes that had multiple functional units and could execute pipelined instructions without stalling if there were no register/computation conflicts. The compilers did the instruction reordering and could flip branch conditions to favor execution speed. (Earliest form of speculative execution?)
Doing the reordering on-the-fly in hardware is impressive. Speculative execution even more so; I understand the high-level concepts but haven't read at all about the implementations.
Both the hardware/software and hardware only versions are continuing attempts to have the hardware spend as little time as possible computing garbage.
_______________
[1] You can re-purpose the ALU during A-instructions to eliminate the A register input mux. Force the ALU to compute "y" during A-instructions and send the A-instruction value to the ALU y input through a 3-way mux selecting between A register, "inM" and "instruction".