I wish I could help you with the slowness, but I have zero experience with Verilog or FPGA in general. There are a few other posts for similar projects in the forum. I'm not sure if any of them include the source code, but it might pay off reading through them.
One of the peculiarities of HACK's architecture is the very low limit of CPU registers: just A, D and PC. This means that almost every other executed instruction is memory access. Perhaps you can optimize this in your implementation.