Somehow vague for tokenizing the set of character files?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Somehow vague for tokenizing the set of character files?

Henoktes722
I'm some how vague of getting the atoms, tokens, from the .jack file. I mean I was assuming to split with space, but that is not the correct one. So what is the correct way to get the tokens, from a set of character files. Like when there is a dot, curly brace and comma?  

I know what must be the tokenized program of this file, but I want to know how to split, get, each tokens?
I'm using javascript.

method void draw(){
      do Screen.setColor(x);
      do Screen.drawRectangle(x, y, x, y);
      return;
   }
Reply | Threaded
Open this post in threaded view
|

Re: Somehow vague for tokenizing the set of character files?

WBahn
Administrator
In general you need to be keeping track of what possible types of tokens you might already have a piece of and then answering the question of which of those types the next character can extend. Once you get to a point where the next character can't extend any of them, you are done with that token and can output it and then start on the next token.

The Jack language was designed to make this pretty easy, which is why there are no multi-character operators (such as '<=') that you have to recognize.
Reply | Threaded
Open this post in threaded view
|

Re: Somehow vague for tokenizing the set of character files?

Henoktes722
This post was updated on .
Sorry WBahn, what do you mean when you say extend?
Can you show me by example if possible?
Reply | Threaded
Open this post in threaded view
|

Re: Somehow vague for tokenizing the set of character files?

WBahn
Administrator
Henoktes722 wrote
Sorry WBahn, what do you mean when you say extend?
Can you show me by example if possible?
Imagine you are reading the input file one character at a time. After reading each character, you have to decide what to do before reading the next character. As you read the file you build up the tokens character by character as far as you can, so you are extending the tokens as you go for as long as you are able to do so. When you can't extend the current token with the next character, you are done with that token and so you process it and then start a new token.

So say you have the following in your file

   method void Fred(int x)
   {
      if (x < 5)
         return true;
      else
         return false;
   }

You will first see the 'm' and this is consistent with being either a keyword or an indentifier.

You then read an 'e' and if this extends the token to "me" this is still consistent with both a keyword and an identifier. You keep reading characters until after you have read the 'd' and now your token is at "method" and it still consistent with being either a keyword or an identifier. When you read the next character, the space character, this cannot be used to extend either a keyword or an identifier, so this token is complete and because keywords take precedence over identifiers, you accept this token as being a keyword and now you start processing the rest of the input starting where you left off, which is the space character. Since whitespace is ignore, you throw that away and move on to the 'v' and start the process again. You will eventually recognize the "void" as a keyword and then "Fred" as an identifier when you can't extend it with the '(' character. The '(' character will be recognized as a symbol when you can't extend it with the 'i' that follows it. This continues until you get to the end of the file.

The definition of the lexical elements in Jack has a slight oversight in it -- the definition of an identifier should have explicitly excluded the keywords. But most people intuitively understand that this is what was intended.
Reply | Threaded
Open this post in threaded view
|

Re: Somehow vague for tokenizing the set of character files?

Henoktes722
Thank you really great answer.

But we may have many options that we can't extend in addition to space and '('. Like in this example

x<5

we can't extend operators like '<' in this example, so if that is the case won't have many options that we can't extend?
Reply | Threaded
Open this post in threaded view
|

Re: Somehow vague for tokenizing the set of character files?

WBahn
Administrator
Henoktes722 wrote
Thank you really great answer.

But we may have many options that we can't extend in addition to space and '('. Like in this example

x<5

we can't extend operators like '<' in this example, so if that is the case won't have many options that we can't extend?
For each type of token (and for each partial string at a given point) we have a specific set of characters that can be used to extend that token. Any other character can't.

So in the case of x<5 we start out with "" (i.e., and empty string) which is compatible with any type of token. We then read 'x' and that is only compatible identifiers (since no keyword starts with an 'x'). So now we have "x" and next we read '<'. This can't be used to extend an identifier, so we have found the first token

identifier: "x"

Now start our next token with the character that couldn't be used, namely "<". This is only compatible with a symbol token. The next character, '5', can't extend this so we have

symbol: "<"

Now start our next token with the character that couldn't be used, namely "5". This is only compatible with an integerConstant (and this is why identifiers can't start with a digit -- we want to limit the number of possible types as quickly as possible). The next character is the end-of-line character, which can't extend an integerConstant, so we have

integerConstant: "5"

Now start our next token with the character that couldn't be read, namely the end-of-line character. No token can start with this, but we are allowed to discard it and move on to the next character.

Just keep going like this until you get to the end of the file.
Reply | Threaded
Open this post in threaded view
|

Re: Somehow vague for tokenizing the set of character files?

Henoktes722
Sure I will do it. So you basically told me to use DFA, Deterministic Finite Automata.
I basically use it for removing comments in C language and it just works like your algorithm.
Reply | Threaded
Open this post in threaded view
|

Re: Somehow vague for tokenizing the set of character files?

WBahn
Administrator
Yep -- the rules that define the tokens are regular expressions and so a DFA can be used to tokenize a source code file. The rules that define the language syntax are (for the most part) context-free, so we can use a PDA to parse the token stream.
Reply | Threaded
Open this post in threaded view
|

Re: Somehow vague for tokenizing the set of character files?

Henoktes722
Thanks, I did the tokenizer part. How can I do the parser? Can you explain it in the way like you expressed the tokenizer?
Reply | Threaded
Open this post in threaded view
|

Re: Somehow vague for tokenizing the set of character files?

WBahn
Administrator
Henoktes722 wrote
Thanks, I did the tokenizer part. How can I do the parser? Can you explain it in the way like you expressed the tokenizer?
I'm not going to go through and explain how to write a parser step by step. That's what YOU'RE supposed to do. I will try to help you get past specific stumbling points as they come up.