Nand2Tetris Questions and Answers Forum › Compiler › Chapter 10

Are there sufficient markers between the tokens for simple tokenrecognition?

Classic

List

Threaded

4 messages Options

Rather Iffy

Are there sufficient markers between the tokens for simple tokenrecognition?

The fact that Jack Tokenizer API has a method hasMoreTokens suggests to me that it must be somehow possible to detect a token in general, so without at the same time determining its type.

The only way I can see that this is possible is when there are enough markers (blanks,eol,punctuation etc.) available in the Jack language so that a string between them always qualifies as as a token.

Is that the case? Or must I prepare for writing something more sophisticated with some backtracking and parallelism ?

ybakos

Re: Are there sufficient markers between the tokens for simple tokenrecognition?

Breaking a .jack file into tokens is really decoupled from determining the type of each token. In other words, creating a list of tokens is easy (it can even be done with one gnarly regular expression) while determining token types requires some comparison to some of the grammar.

cadet1620

Re: Are there sufficient markers between the tokens for simple tokenrecognition?

Administrator

In reply to this post by Rather Iffy

HasMoreTokens() needs to scan across whitespace and comments and return true when it finds something non-white.

My compiler, in fact, doesn't have hasMoreTokens(). I have advance() return whether it found something or hit end-of-file so I can simply say
while (tokenizer.advance()) {
// Process the token.
}

--Mark

Rather Iffy

Re: Are there sufficient markers between the tokens for simple tokenrecognition?

This post was updated on .

In reply to this post by Rather Iffy

I got it !
Thanks for the tip.

I found a tokenizer in the Python RE module documentation. It really is a programming pearl, less as 50 lines of code.

By modifying it a little bit it now can produce the same tokenstream, wrapped in xml tags, than ExpressionlessSquare/MainT.xml.

The gnarly regular expression came to me very naturally when studying and testing the tokenizer.

The modified tokenizer that I've built determines the token type without querying a parser.

Now I am going to figure out how to refactor the code to conform to the JackTokenizer API.