I'm getting started on the compiler and want some input on how to implement the Tokenizer/Lexical-Analyser. I plan to write it in Java and am trying to decide whether to use (a) Jlex (b) Java regex with Matcher or (c) hand-construct a Finite State Machine to recognize the tokens.
Which of these would be best or are there better options?
Unless you have built a tokenizer before, I recommend starting simple, with either brute-force string matching (ugly) or regular expressions (less ugly).
Thanks for the suggestion. I used Regex/Matcher for the Assembler project (Project 6) and for this one I just downloaded JFlex and was going through their example sample code. From what I can see JFlex uses regular expressions but with automated infrastructure to generate tokens so I am going to go this route -- plus it is an additional learning opportunity (JFlex).
My tokenizer reads the file a line at a time into a buffer, and then parses the line a character at a time. I wasn't particularly worried about efficiency since we aren't trying to compile projects with millions of lines of source code.
The code is organized into multiple routines that each handle a particular token. Parse() skips spaces and then looks at the current character in the line and determines what token it starts. It then calls the appropriate ParseXxx() function to complete the token's parsing.