Suggested method to implement Tokenizer?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Suggested method to implement Tokenizer?

sakumar
I'm getting started on the compiler and want some input on how to implement the Tokenizer/Lexical-Analyser. I plan to write it in Java and am trying to decide whether to use (a) Jlex (b) Java regex with Matcher or (c) hand-construct a Finite State Machine to recognize the tokens.

Which of these would be best or are there better options?

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Suggested method to implement Tokenizer?

ybakos
Unless you have built a tokenizer before, I recommend starting simple, with either brute-force string matching (ugly) or regular expressions (less ugly).
Reply | Threaded
Open this post in threaded view
|

Re: Suggested method to implement Tokenizer?

sakumar
ybakos wrote
Unless you have built a tokenizer before, I recommend starting simple, with either brute-force string matching (ugly) or regular expressions (less ugly).
Thanks for the suggestion. I used Regex/Matcher for the Assembler project (Project 6) and for this one I  just downloaded JFlex and was going through their example sample code. From what I can see JFlex uses regular expressions but with automated infrastructure to generate tokens so I am going to go this route -- plus it is an additional learning opportunity (JFlex).
Reply | Threaded
Open this post in threaded view
|

Re: Suggested method to implement Tokenizer?

peterxu422
Would it be inefficient reading the .jack file one character at a time into some char buffer and checking for cases that would terminate a token (e.g. a space, certain symbols, etc.)?
Reply | Threaded
Open this post in threaded view
|

Re: Suggested method to implement Tokenizer?

peterxu422
In reply to this post by ybakos
Is there a least ugly method?
Reply | Threaded
Open this post in threaded view
|

Re: Suggested method to implement Tokenizer?

cadet1620
Administrator
My tokenizer reads the file a line at a time into a buffer, and then parses the line a character at a time.  I wasn't particularly worried about efficiency since we aren't trying to compile projects with millions of lines of source code.

The code is organized into multiple routines that each handle a particular token. Parse() skips spaces and then looks at the current character in the line and determines what token it starts. It then calls the appropriate ParseXxx() function to complete the token's parsing.

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Suggested method to implement Tokenizer?

sakumar
In reply to this post by peterxu422
peterxu422 wrote
Is there a least ugly method?
Well, I ended up using JFlex. This tool (more or less) takes a lexical specification (which you write as a .flex file) and automatically generates a java file that is the Tokenizer for Jack.

It's not ugly but does require you to learn an additional tool.