|
|
This post was updated on .
I decided to test the Jack Compiler a bit and see what kinds of input it would consider valid so that my compiler would be able to compile all code created for Jack. What I've run into is a bit of confusion about how to separate tokens. My initial instinct was that every token would be separated by some kind of white space. I thought this was a safe assumption because all of my code up to that point was written with readability in mind, but consider for example the following single line of code:
function int increment(int num){var int returnNum; let returnNum=num; let returnNum=returnNum+1;return returnNum;}
This is successfully compiled by the Jack Compiler, but only certain parts of the statement are actually required to be separated by white space. Specifically tokens which begin with characters. The textbook says:
The tokens may be separated by an arbitrary number of space characters, newline characters and
comments, which are ignored.
But obviously keywords must be separated by white space lest they confuse the compiler. In some cases zero white space is sufficient while in others there must be a minimum of 1 white space character. How can I be certain which is which without having to make my own assumptions? I feel like there's something very basic that I've missed.
EDIT:
Is it more correct to say that tokens adjacent to symbols don't require that they be separated by white space, while tokens not adjacent to symbols do?
For that matter I've run into another issue. I was originally going to tokenize one line at a time, but I've since discovered that this is valid input:
while
(number
<
10)
Would it make more sense to replace all newlines with spaces in order to make reading and searching the code for the next white space or symbol easier? Does this become incredibly inefficient when Jack projects start to get too large?
|
Administrator
|
I think that the only place where whitespace (space, tab, newline, /*...*/) is required is between keywords, identifiers and numbers. This monster compiles to identical .vm files with JackCompiler and both versions of my compilers.
[c:/TECS/projects/bin] % cat minsp.jack
class minsp{function int increment(int num){var int returnNum;let returnNum=num;let returnNum=returnNum+1;return returnNum;}
function
int
decrement(int
num){var
int
returnNum;let
returnNum=num;let
returnNum=returnNum-1;return
returnNum;}
function/**/int/**/zero(){return/**/0;}}
[c:/TECS/projects/bin] % JackCompiler minsp.jack ; cp minsp.vm 1
Compiling "c:\TECS\projects\bin\minsp.jack"
[c:/TECS/projects/bin] % hjc minsp.jack ; cp minsp.vm 2
Processing minsp.jack
[c:/TECS/projects/bin] % rexjc/hjc minsp.jack ; cp minsp.vm 3
Processing minsp.jack
[c:/TECS/projects/bin] % md5 1 2 3
3DDBEA0372F87915A2A1D5C1DA0A525D 1
3DDBEA0372F87915A2A1D5C1DA0A525D 2
3DDBEA0372F87915A2A1D5C1DA0A525D 3
[c:/TECS/projects/bin] %
--Mark
As an interesting aside, my compiler which I've been using for years with my own and students' Jack code had a tokenizer bug that I just found! The sequence
/* Some comment
that has // as part of its text. */
do something();
miscompiled because it stripped the */ as part of an end-of-line comment so the following lines were discarded as continuations of the multi-line comment.
Interestingly, if the comment opened and closed on the same line, it was OK with //.
/* Some comment that has // as part of its text. */
do something();
|
|
cadet1620 wrote
I think that the only place where whitespace (space, tab, newline, /*...*/) is required is between keywords, identifiers and numbers.
That basically sums up the "rule": there needs to be a space between keywords, identifiers and literals.
|
Administrator
|
ybakos wrote
cadet1620 wrote
I think that the only place where whitespace (space, tab, newline, /*...*/) is required is between keywords, identifiers and numbers.
That basically sums up the "rule": there needs to be a space between keywords, identifiers and literals.
To be picky, numeric literals. String literals are OK:
return"this";
doesn't need spaces.
--Mark
|
|
cadet1620 wrote
As an interesting aside, my compiler which I've been using for years with my own and students' Jack code had a tokenizer bug that I just found! The sequence
/* Some comment
that has // as part of its text. */
do something();
miscompiled because it stripped the */ as part of an end-of-line comment so the following lines were discarded as continuations of the multi-line comment.
Interestingly, if the comment opened and closed on the same line, it was OK with //.
/* Some comment that has // as part of its text. */
do something();
I've actually been banging my head against the wall trying to figure out how I'm going to strip away comments from my jack files before I tokenize them. My current plan is to read the entire Jack program in to a single string. I will process that string by replacing everything between /* and */ with a space. Following that I will replace all text between // and a newline character with a space. Now to figure out how to handle /*/. I think that if I take the index of /* and make sure it checks for */ starting at that index + 2 I should be able to avoid any issues.
Are /** API comments */ ever handled differently than /* comments until closing*/ in the Jack Compiler? Are API comments ever treated differently in this course?
As an aside, my original idea for this had me reading the text line by line and removing the // comments first but the problem with doing that is exactly the same issue that you've run into in your compiler. By removing those comments first you can run into an issue where you comment away */. I'd imagine in your compiler the check for a comment is still being performed line by line even though you're still in a comment. I had a large and convoluted way to check for that but felt like my solution wasn't elegant enough.
|
|
sgaweda wrote
cadet1620 wrote
As an interesting aside, my compiler which I've been using for years with my own and students' Jack code had a tokenizer bug that I just found! The sequence
/* Some comment
that has // as part of its text. */
do something();
miscompiled because it stripped the */ as part of an end-of-line comment so the following lines were discarded as continuations of the multi-line comment.
Interestingly, if the comment opened and closed on the same line, it was OK with //.
/* Some comment that has // as part of its text. */
do something();
I've actually been banging my head against the wall trying to figure out how I'm going to strip away comments from my jack files before I tokenize them. My current plan is to read the entire Jack program in to a single string. I will process that string by replacing everything between /* and */ with a space. Following that I will replace all text between // and a newline character with a space. Now to figure out how to handle /*/. I think that if I take the index of /* and make sure it checks for */ starting at that index + 2 I should be able to avoid any issues.
do Output.printString("Am I a /* comment? */");
Instead, try "be" the tokenizer. You're looking at the current character (and probably one ahead) and you're moving forward. If you see a quote, you're in string mode until you see a closing quote. If you see slash, you may be entering comment mode (but not if you're in string mode). What is the next character?
Are /** API comments */ ever handled differently than /* comments until closing*/ in the Jack Compiler? Are API comments ever treated differently in this course?
No, at least not by any tool.
As an aside, my original idea for this had me reading the text line by line and removing the // comments first but the problem with doing that is exactly the same issue that you've run into in your compiler. By removing those comments first you can run into an issue where you comment away */. I'd imagine in your compiler the check for a comment is still being performed line by line even though you're still in a comment. I had a large and convoluted way to check for that but felt like my solution wasn't elegant enough.
|
|
ivant wrote
Instead, try "be" the tokenizer. You're looking at the current character (and probably one ahead) and you're moving forward. If you see a quote, you're in string mode until you see a closing quote. If you see slash, you may be entering comment mode (but not if you're in string mode). What is the next character?
Ok. I think I can sort of make sense of that, but does that mean I'm basically building my tokens one character at a time? If I have a Lexical Element I'll need to build it until something (white space or symbol) tells me to stop?
|
Administrator
|
ivant wrote
Are /** API comments */ ever handled differently than /* comments until closing*/ in the Jack Compiler? Are API comments ever treated differently in this course?
No, at least not by any tool.
One company I worked for had an automatic doucumentation generating tool that read comments and reformatted them based on what came after the /*. There were various indicators to mean module comments, function operation comments, function arguments, data structures, etc. It was a major PITA!
--Mark
|
|
sgaweda wrote
ivant wrote
Instead, try "be" the tokenizer. You're looking at the current character (and probably one ahead) and you're moving forward. If you see a quote, you're in string mode until you see a closing quote. If you see slash, you may be entering comment mode (but not if you're in string mode). What is the next character?
Ok. I think I can sort of make sense of that, but does that mean I'm basically building my tokens one character at a time? If I have a Lexical Element I'll need to build it until something (white space or symbol) tells me to stop?
Right.
BTW, I'm not trying to be sly here. I'm trying to give you hints without too many "spoilers", but I have very little experience with teaching.
|
Administrator
|
sgaweda wrote
Ok. I think I can sort of make sense of that, but does that mean I'm basically building my tokens one character at a time? If I have a Lexical Element I'll need to build it until something (white space or symbol) tells me to stop?
Yes, that's essentially what the _ParseInt(), _ParseIdent(), etc. functions do in the Python snippet I mailed you. By the way, that tokenizer chokes on Ivan's
do Output.printString("Am I a /* comment? */");
I had to move the comment handling from the outer line reading loop into the parsing loop; adding _ParseCommentEOL() and _ParseCommentInLine() that get called when the parser encounters comments.
Because I'm processing the file a line at a time, there is a bit of complication in the parsing loop; it needs an 'inComment' variable that gets set when _ParseCommentInLine() doesn't find the '*/'.
--Mark
|
Administrator
|
sgaweda wrote
Would it make more sense to replace all newlines with spaces in order to make reading and searching the code for the next white space or symbol easier? Does this become incredibly inefficient when Jack projects start to get too large?
My largest Jack file is 52K which wouldn't be a memory or speed problem on any modern computer. Don't forget to translate \t (tab) to spaces, and on Windows you might also need to translate \r (return).
I read the files by lines so that I can include the raw source lines in error messages and optionally as debugging comments in the VM output. For example:
[c:/TECS/projects/bin] % hjc Main.jack
Processing Main.jack
Main.jack(107): while (~t>360) {
Warning: undefined precedence "~" followed by ">".
--Mark
|
|
The book doesn't specify how to handle nested comments. I.e.
/* is /* this */ valid? */
The book just states that
Comments are of the standard formats /* comment until closing */, /** API comment */, and // comment to end of line.
So, a conforming lexer may interpret the above as the comment /* is /* this */ followed by identifier valid, then ? (which I think is invalid punctuation in Jack), and then a closing comment, which is an error.
Handling nested comments is hard. If I remember correctly, for many years the C and C++ compilers didn't handle them. Nested comments are handy when you want to comment out a piece of code, e.g.:
do Output.printString("Hello, ");
/* I'm very important explanation about the following line */
do Output.printString("World?");
So if you just enclose it in comments it will look like:
/*
do Output.printString("Hello, ");
/* I'm very important explanation about the following line */
do Output.printString("World?");
*/
And, as we saw, that will probably fail. A rule of thumb is to use EOL comments instead, because they can be nested in multi-line comments:
/*
do Output.printString("Hello, ");
// I'm very important explanation about the following line
do Output.printString("World?");
*/
BTW, here is an interesting corner case, which I'm not sure how should be handled:
/* I'm a multi-line
// comment */
|
Administrator
|
This post was updated on .
Allowing nested comments means parsing the content of comments which opens Pandora's Box!
/* Save this code for later:
let str = "Comments end with */";
do Output.printString(str);
// Free the memory used by the ... end with */ message.
do str.dispose();
*/
I experimentally determined that JackCompiler terminates /* comments with the first occurrence of */.
/* and // are ignored if they occur in comments.
And then there's Pascal with two types of block comments, { } and (* *).
Turbo/Borland/Delphi Pascal allows nesting of alternating types so you can use { } to comment out code that includes (* *) comments.
Standard Pascal (ISO/IEC 10206:1990) treats { and (* as lexically identical and disallows } or *) in the comment text.
Where a commentary shall be any sequence of characters and separations of lines, containing neither } nor *), the construct
( '{' | '(*' ) commentary ( '*)' | '?' )
shall be a comment if neither the { nor the (* occurs within a character-string or within a commentary.
NOTES
1 A comment may thus commence with { and end with *), or commence with (* and end with }.
2 The sequence (*) cannot occur in a commentary even though the sequence {) can.
Yuch, what was ISO thinking!?
--Mark
|
|
ivant wrote
The book doesn't specify how to handle nested comments. I.e.
/* is /* this */ valid? */
The book just states that
Comments are of the standard formats <tt>/* comment until closing */</tt>, <tt>/** API comment */</tt>, and <tt>// comment to end of line</tt>.
So, a conforming lexer may interpret the above as the comment <tt>/* is /* this */</tt> followed by identifier <tt>valid</tt>, then <tt>?</tt> (which I think is invalid punctuation in Jack), and then a closing comment, which is an error.
Handling nested comments is hard. If I remember correctly, for many years the C and C++ compilers didn't handle them. Nested comments are handy when you want to comment out a piece of code, e.g.:
do Output.printString("Hello, ");
/* I'm very important explanation about the following line */
do Output.printString("World?");
So if you just enclose it in comments it will look like:
/*
do Output.printString("Hello, ");
/* I'm very important explanation about the following line */
do Output.printString("World?");
*/
And, as we saw, that will probably fail. A rule of thumb is to use EOL comments instead, because they can be nested in multi-line comments:
/*
do Output.printString("Hello, ");
// I'm very important explanation about the following line
do Output.printString("World?");
*/
I suppose there's a simple way to handle this and that's to keep looking for leading /* comment identifiers even while in a comment and to increment a tracking variable (lets call it commentNestLevel) that lets you know how deeply you are nested within a comment. You would then decrement commentNestLevel whenever a matching */ comment closure is found and terminate comment tracking when commentNestLevel is back to being 0.
ivant wrote
BTW, here is an interesting corner case, which I'm not sure how should be handled:
/* I'm a multi-line
// comment */
I'd imagine this is trivial if your compiler doesn't waste time on // comment identifiers while already in a comment. That's assuming that you want to give precedence to comments until closure over comments until end of line.
If there's something I'm missing here I'd like to know. I've been on hiatus for a while because life took over but I'm back to working on this chapter and hopefully finishing the book soon.
|
Administrator
|
The supplied Jack compiler treats "/*" as the beginning of a comment. All text following the "/*" is part of the comment and is not parsed, including "//" and "/*". The comment ends with the first occurrence of "*/".
Similarly, it treats "//" as the beginning of a comment that ends with the first occurrence of \n (newline). The contents of the comment, including "/*", is not parsed.
This is the closest thing that we have for a definitive specification for how comments are to be handled in Jack.
--Mark
|
|