I've only tested it up to project 2.
I don't know what you mean by the implied parallelism handling -- it just looks at each chip and recursively looks up its parts until it reaches a nand gate, and adds 1 to the nand count when it finds each nand gate.
I ignore the [x..y] ranges as they don't matter -- the count just goes over the parts.
For example, Not16 would have a parts list of 16 Nots, which each has a parts list of 1 Nand. So Not16 is 16.
Here are my counts:
$ ./counter.py *.hdl
And: 2
Mux16: 64
And16: 32
Or8Way: 21
Mux4Way16: 192
Mux8Way16: 448
DMux8Way: 35
Nand: 1
DMux4Way: 15
Not: 1
Not16: 16
HalfAdder: 7
Add16: 262
Or16: 48
FullAdder: 17
Xor: 5
Or: 3
ALU: 772
Inc16: 262
Mux: 4
DMux: 5
I was thinking of trying to do a more efficient Inc166 per the lecture and so was curious if I could do it with fewer gates -- but obviously needed a tool to count them first!