Really struggling with the compilation engine

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Really struggling with the compilation engine

riverfish
I implemented the Syntax Analyzer no problem. But I'm trying to figure out the best way to implement the compilation engine.

What I've been trying to do is compile the class, then within the compileClass function, call compileClassVarDec and compileSubroutineDec, then within that call compileSubroutineBody and  keep calling into you've reached a terminal end.

That strategy seemed to be working well until I got to the statements part.

I'm having lots of trouble here because it's possible to nest statements. So trying to find when a block of statements starts and finishes (i.e. between {    }) becomes hard when you have nested statements which also have {       }.

It got to the point where I was having to use a lot of complex nested logic to skip certain sub statements in order to reach the end of the main statement block.

Which got me thinking, is there a simpler way to do this?

I'm finding this project 10x harder than all the previous ones. I suspect my approach may be wrong here.
Reply | Threaded
Open this post in threaded view
|

Re: Really struggling with the compilation engine

riverfish
Responding to my own thread cos this is really doing my head in, but just had an interesting thought!!


Is there any point to the JackTokenizer at all?

I feel it'd be easier to feed an entire .jack file as one massive string to the Compiler and then using regex to match different parts like statements, ifStatements, whileStatements, subroutineBody, etc etc.

And then within each subsection you can compile straight to XML. Thoughts about this??
Reply | Threaded
Open this post in threaded view
|

Re: Really struggling with the compilation engine

ivant
If you successfully implemented the syntax analyzer, you've already solved this problem. The code generation "just" generates relevant code instead of XML.

In any case, don't use regular expressions to parse arbitrary nested structures. They really aren't good for that.
Reply | Threaded
Open this post in threaded view
|

Re: Really struggling with the compilation engine

riverfish
Not really catching your drift. Maybe I rephrased it wrong. I meant to say I've finished the Tokenizer part, but now struggling with the Parser bit.
Reply | Threaded
Open this post in threaded view
|

Re: Really struggling with the compilation engine

cadet1620
Administrator
This is the beauty of the recursive descent parser/compiler. It takes care of the nesting structure without needing to do any parsing for balanced delimiters. Consider how hard it would be to parse something like
{ do Output.printString("}); do x}");}
using a linear parser like regex.

You should have functions in your JackAnalyzer for each production in the syntax definition.

The call tree for parsing statement-list should look like
CompileStatementList
  CompileStatement    simple statement like 'let'
    CompileLet
      *
  CompileStatement    statement with nested statement-list
    CompileIf
      CompileExpression
        *
      CompileStatementList
        CompileStatement
          *
        ...
  CompileStatement
    *
  ...

*  Deeper recursive calls.

You can add more CompileXxx that handle common sequences. For instance I have CompileStatementBlock that deals with "{ statement-list }" in if and while statements.
def _CompileStatementBlock(self):
    """
    Compiles <statement-block> := '{' <statement-list> '}'

    ENTRY: Tokenizer positioned on '{'.
    EXIT:  Tokenizer positioned after '}'.
    """
    self._ExpectSymbol('{')
    self._WriteXml('symbol', self.tokenizer.Symbol())
    self._NextToken()

    self._CompileStatementList()

    self._ExpectSymbol('}')
    self._WriteXml('symbol', self.tokenizer.Symbol())
    self._NextToken()

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Really struggling with the compilation engine

riverfish
Cadet I'm curious to see how you implemented this part:
    ENTRY: Tokenizer positioned on '{'.
    EXIT:  Tokenizer positioned after '}'

I get the general gist of how it should work, my problem is trying to determine the start and end of a block of statements when statements are nested within each other.

Take for example, this function:

 method void run() {
      var char key;
      var boolean exit;

      let exit = key;
      while (exit) {
         while (key) {
            let key = key;
            do moveSquare();
         }

         if (key) { let exit = exit; }
         if (key) { do square.decSize(); }
         if (key) { do square.incSize(); }
         if (key) { let direction = exit; }
         if (key) { let direction = key; }
         if (key) { let direction = square; }
         if (key) { let direction = direction; }

         while (key) {
            let key = key;
            do moveSquare();
         }
      }
      return;
    }


If we take a statement block as anything after a '{' and before a '}', then when the compileStatements function meets the first while statement, it should call compileStatements again on everything between '{' and '}' like so:


while (exit) {
<-- call compileStatements on all code below this line -->
   while (key) {
            let key = key;
            do moveSquare();
         }

         if (key) { let exit = exit; }
         if (key) { do square.decSize(); }
         if (key) { do square.incSize(); }
         if (key) { let direction = exit; }
         if (key) { let direction = key; }
         if (key) { let direction = square; }
         if (key) { let direction = direction; }

         while (key) {
            let key = key;
            do moveSquare();
         }
      }
      return;
<----- --------------- --------->
}

The problem I'm finding though is that if you determine a new <statements> block as the first occurrence of '}' (closing parens) after a '{' (opening parens) then you will meet a problem when you try to compile the above code because it will only compile up to the first '}' and not to it's correct corresponding closing parens.

while (exit) {
<-- call compileStatements -->
while (key) {
            let key = key;
            do moveSquare();
         }
<--up to this line because it meets a closing parens ('}') --->
         if (key) { let exit = exit; }
         if (key) { do square.decSize(); }
         if (key) { do square.incSize(); }
         if (key) { let direction = exit; }
         if (key) { let direction = key; }
         if (key) { let direction = square; }
         if (key) { let direction = direction; }

         while (key) {
            let key = key;
            do moveSquare();
         }
      }
      return;
}


Therefore anything from if (key) { let exit = exit; } and below is not called by compileStatements.

Reply | Threaded
Open this post in threaded view
|

Re: Really struggling with the compilation engine

cadet1620
Administrator
[Note: my <statement-list> == the course's <statements>. I have been burnt way, way, way too many times over my career by symbol names that only vary by a trailing 's'!]

riverfish wrote
Cadet I'm curious to see how you implemented this part:
    ENTRY: Tokenizer positioned on '{'.
    EXIT:  Tokenizer positioned after '}'
This an entry/exit requirement for CompileStatementBlock.
ENTRY: says that on entry the tokenizer is expected to be on a '{' token.
EXIT: guarantees that when CompileStatemenBlock returns it has just seen its closing '}' token and has advanced the tokenizer.

Here's a bit of CompileIf:
self._ExpectSymbol('(')
self._WriteXml('symbol', self.tokenizer.Symbol())
self._NextToken()

self._CompileExpression()

self._ExpectSymbol(')')
self._WriteXml('symbol', self.tokenizer.Symbol())
self._NextToken()

self._CompileStatementBlock()

hasElse = self.tokenizer.TokenType() == TK_KEYWORD and \
          self.tokenizer.Keyword() == KW_ELSE

my problem is trying to determine the start and end of a block of statements when statements are nested within each other.
CompileStatementList only parses statements and they all start with a keyword. When it sees the keyword it calls CompileStatement which calls the appropriate CompileXxx which parses the entire statement including the final ';' or '}'. Then CompileStatementList checks to see if there is another keyword. If it finds something that isn't a keyword it returns so that whoever called it can deal with that token. CompileStatement and CompileStatementList never see '{' or '}' except as "something that isn't a statement keyword".

Assuming that you don't have CompileStatementBlock, and the { } are handled in CompileWhile and CompileIf, here's what's handled at each recursion level and how it matches up the { }.

while (true) { while (false) { if (true) { return; } } }

CompileStatementList
  CompileStatement
    CompileWhile
      CompileExpression
        ...
    CompileWhile
      CompileStatementList
        CompileStatement
          CompileWhile
            CompileExpression
              ...
          CompileWhile
            CompileStatementList
              CompileStatement
                CompileIf
                  CompileExpression
                    ...
                CompileIf
                  CompileStatementList
                    CompileStatement
                      Compilereturn
                CompileIf
          CompileWhile
    CompileWhile


--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Really struggling with the compilation engine

riverfish
Hey Cadet,

I have something similar to CompileStatementBlock.

My problem lies in trying to determine when to return CompileStatementBlock.

This an entry/exit requirement for CompileStatementBlock.
ENTRY: says that on entry the tokenizer is expected to be on a '{' token.
EXIT: guarantees that when CompileStatemenBlock returns it has just seen its closing '}' token and has advanced the tokenizer.
This is the part that I'm struggling with - how does CompileStatementBlock know it has seen its closing '}'?

Reply | Threaded
Open this post in threaded view
|

Re: Really struggling with the compilation engine

cadet1620
Administrator
riverfish wrote
Hey Cadet,

I have something similar to CompileStatementBlock.

My problem lies in trying to determine when to return CompileStatementBlock.

This is the part that I'm struggling with - how does CompileStatementBlock know it has seen its closing '}'?
def _CompileStatementBlock(self):
    """
    Compiles <statement-block> := '{' <statement-list> '}'
    
    Nested Scope option inserts <var-dec>* after '{'

    ENTRY: Tokenizer positioned on '{'.
    EXIT:  Tokenizer positioned after '}'.
    """
    self._ExpectSymbol('{')
    self._WriteXml('symbol', self.tokenizer.Symbol())
    self._NextToken()

    if self.options['scopeExt']:
        # enter the subscope
        self.symbolTable = self.symbolTable.Subscope()
        
        self._CompileScopeDec(zero=True)
    self._CompileStatementList()
    
    self._ExpectSymbol('}')
    if self.options['scopeExt']:
        # exit the subscope
        self._WarnUnrefs('subscope')
        self.symbolTable = self.symbolTable.Superscope()

    self._WriteXml('symbol', self.tokenizer.Symbol())
    self._NextToken()
    
        
def _ExpectSymbol(self, symbols):
    """
    Parse the current token.  It is expected to be one of 'symbols'.
    'symbols' is a string of one or more legal symbols.

    Returns the symbol parsed or raises an error.
    """
    if not self.tokenizer.TokenType() == TK_SYMBOL:
        self._RaiseError('Expected '+self._SymbolStr(symbols)+', got '+
                         self.tokenizer.TokenTypeStr())
    if self.tokenizer.Symbol() in symbols:
        return self.tokenizer.Symbol()
    self._RaiseError('Expected '+self._SymbolStr(symbols)+', got '+
                     self._SymbolStr(self.tokenizer.Symbol()))

The ExpectSymbol calls are where CompileStatementBlock handles '{' and '}'. If the current token is not the expected token, it raises an exception which prints an error message and aborts the compile.

(I have similar ExpectKeyword and ExpectIdentifier routines, also IsSymbol, etc. routines that return true/fasle instead of raising errors.)

After CompileIf has called CompileExpression, the current token will be ')'. It verifies this with ExpectSymbol, advances the tokenizer and verifies that the new token is '{' and then calls CompileStatementBlock.

CompileStatementBlock verifies the '{', advances the tokenizer and calls CompileStaementList.

CompileStatementList process all the statements in the block, recursively calling CompileXxx routines, potentially including CompileStatementList itself, that parse everything including matching '{' and '}'.

When CompileStatementList returns to CompileStatementBlock, the tokenizer will be on the '}' that matches the '{' that CompileStatementBlock started with. CompileStatementBlock verifies that it is there, advances the tokenizer and returns.

This is what makes recursive parsing work. It "magically" handles nested structures.


Perhaps it will help you understand recursive parsing by turning the problem inside out. Consider recursive printing. Here is a program that prints nested "hello world" messages. (Python 3; end='' suppresses new-lines.)
def hello_world(n):
    print('hello ', end='')
    if n > 1:
        print('{ ', end='')
        hello_world(n-1)
        print('} ', end='')
    print('world! ', end='')


hello_world(1)
print()
hello_world(3)

# Prints
# hello world!
# hello { hello { hello world! } world! } world!

I think it might be better to continue this by email.  I can share more code, you can send me your code to look at, etc.

--Mark