Are there sufficient markers between the tokens for simple tokenrecognition?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Are there sufficient markers between the tokens for simple tokenrecognition?

Rather Iffy
The fact that Jack Tokenizer API has a method hasMoreTokens suggests to me that it must be somehow possible to detect a token in general, so without at the same time determining its type.

The only way I can see that this is possible is when there are enough markers (blanks,eol,punctuation etc.) available in the Jack language so that a string between them always qualifies as as a token.

Is that the case? Or must I prepare for writing something more sophisticated with some backtracking and  parallelism ?
Reply | Threaded
Open this post in threaded view
|

Re: Are there sufficient markers between the tokens for simple tokenrecognition?

ybakos
Breaking a .jack file into tokens is really decoupled from determining the type of each token. In other words, creating a list of tokens is easy (it can even be done with one gnarly regular expression) while determining token types requires some comparison to some of the grammar.

Reply | Threaded
Open this post in threaded view
|

Re: Are there sufficient markers between the tokens for simple tokenrecognition?

cadet1620
Administrator
In reply to this post by Rather Iffy
HasMoreTokens() needs to scan across whitespace and comments and return true when it finds something non-white.

My compiler, in fact, doesn't have hasMoreTokens(). I have advance() return whether it found something or hit end-of-file so I can simply say
    while (tokenizer.advance()) {
        // Process the token.
    }

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Are there sufficient markers between the tokens for simple tokenrecognition?

Rather Iffy
This post was updated on .
In reply to this post by Rather Iffy
I got it !
Thanks for the tip.

I found a tokenizer in the Python RE module documentation. It really is a programming pearl, less as 50 lines of code.

By modifying it a little bit it now can produce the same  tokenstream, wrapped in xml tags, than  ExpressionlessSquare/MainT.xml.

The gnarly regular expression came to me very naturally  when studying and testing the tokenizer.

The modified tokenizer that I've built determines the token type without querying a parser.

Now I am going to figure out how to refactor the code to conform to the JackTokenizer API.