The efficiency is not really very well correlated with the number of parts involved, especially in Nand2Tetris.
For example, the Not gate is implemented using a Nand gate, but while this is nice from the perspective of building everything from a single starting primitive, it is not how a Not gate is really built. In CMOS, this approach uses four transistors but it is actually built using just two. It gets far worse for more complex gates. While N2T doesn't have the student build a Nor gate (only because the project doesn't happen to require one), if they did it would be an Or gate followed by a Not gate. The Or gate, in turn, is likely implemented using a Nand gate and two Not gates. The end result is a gate that would take 16 transistors when a real implementation would use four transistors and be three times as fast.
Another shining example is the Xor gate. Using the intuitive implementation presented in the text, it would use 40 transistors and have 5 prop delays (a prop delay is the amount of time it takes a single to propagate through a single Nand gate) while the usual implementation only uses 16 transistors and has 3 prop delays.
In the case of the part reductions discussed here, the helper chips can significantly reduce the overall transistor count and really speed things up in a real implementation.
The combination of a Not and a Mux to selectively invert a signal can be replaced by a single Xor gate. An Xor gate can be implemented using four Nand gates (for a total of 16 transistors) and have 3 prop delays. The Not/Mux combination, as most N2T students would likely implement them, uses 10 Nand gates (40 transistors) and has 6 prop delays.