1) You dont want to eliminate whitespace (spaces, tabs, newlines) before tokenizing. You do want to skip whitespace in Tokenizer.advance().
For the sequence "static boolean test;" your tokenizer should return
keyword "static"
keyword "boolean"
identifier "test"
symbol ";"
2)
In my compiler, Analyser creates a new CompilationEngine, passing it the names of the source and VM files, and optionally the XML file. CompilationEngine creates a new Tokenizer passing it the name of the source file.
Tokenizer open the file and reads all the lines into a string array. advance does something like
while not end-of lines
if current line.length == 0
current line = next line
skipWhitespace() // trims whitespace from front of currentLine
if currentLine[0] is alpha or '_'
parseKeywordOrIdent(); // parses name, sets up Tokenizer return values, removes from currentLine
else if currentLine[0] is numeric
parseNumber();
...
3) I don't use regex. Python 2.7 if fine. My compiler runs without modification on 2.7 and 3.x
4) Array is an identifier, as are all other class and variable names.
--Mark