I am trying to split the Jack programs to turn them into the token lists by the whitespace and symbols. I use re. module to specify several symbols as separators. However, I cannot include those symbols in the token.
I found some websites to teach the method, but they were about Java Script and not Python that I am using. I attached my py file for the test if necessary.split.py
Could you give me any advice on how to do that? Or is there any other appropriate way to tokenize the programs?
From a pedagogical standpoint, I think it is better to NOT use regular expressions. Do it yourself. You will learn a lot more -- and make it easier to understand the use of regular expressions down the road.
Walk through the file character by character. At any given point you have a partial token and a list of possible token types consistent with the partial token. See if the new character can extend any of the possible token types and, if it can, append it to the partial token. If not, then your partial token is now your token. Add it to your token list and the character you just read is the partial token for the next token. You can deal with whitespace explicitly or add a new "whitespace" token type that way all processing is the same, except that you simply don't add whitespace tokens to your token list.
Thank you for your reply. I could not understand the advice fully, and below are the ideas of the method of my tokenizer that came to mind.
An example code is 'let b = Fraction();'.
1. Split a code by whitespace ([let, b, =, Fraction(1,2);] is generated.)
2. Extract each element
3. If the element has any symbol, split it by the symbol to generate a new list ([Fraction, );] is generated.
4. If the element has no symbol, add it to a blank list
I intended to apply a recursive function on the list in No.3, but it did not work. I attached my test python file, so could you give me any hint to modify the program? I searched about the recursive structure, but I had no idea.
Start off with an empty token and extend it as far as you can.
let b = Fraction();
The next character is 'l'. If this is added to the token as we have it so far, can it possibly be the start of a valid token? Yes, token = 'l' can be the start of an identifier or a keyword.
We keep going, adding characters until we can't extend it any further.
'le' can be the start of a valid token.
'let' can be the start of a valid token.
'let<space' can't be the start of a valid token, so we do not add the space to it. Instead, we add the last thing that we had, 'let' to the token list.
Then we start a new token with the character we couldn't add, which is the space. Or, knowing that whitespace is ignored, we can start a new empty token and keep skipping any whitespace until we get to something that. Either way, we eventually end up with a token 'b' that can't be extended by the character that follows it, so that is our next token that we add to the token list.
Notice that this approach works even if there's no whitespace in there that doesn't have to be.
when we get to 'b', the '=' can't extend it because 'b=' can't be the start of a valid token. So we add 'b' to the token list and start the next token with '='. Since '=F' can't be the start of a valid token, our next token added to the list is '=' and we start again with 'F'.
While Jack was carefully written so that there are no multi-symbol tokens, such as '>=', this approach will work find even if your language includes those. Do you see where your approach of splitting on every symbol wouldn't?