Tokenizer XML output - Why the extra spaces?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Tokenizer XML output - Why the extra spaces?

Gerrit0
After I finished the tokenizer and compared my output to the reference XML, I realized something strange. The tokenizer output provided includes an extra space before and after each symbol:
<tokens>
<keyword> class </keyword>
<identifier> Main </identifier>
...
Why does it do this? Yes, XML doesn't care about whitespace between tags, which enables more human-readable documents, but now when reading in the file I have to trim the text content of each token. Notably, I can't just call .trim(), .strip(), or whatever the language equivalent is that I'm using, because strings might contain trialing spaces, in which case there will be one extra space. To parse the provided output I have to trim exactly one space from each token text to get the content.
<stringConstant> HOW MANY NUMBERS?  </stringConstant>
(Note two spaces after the question mark.)
Reply | Threaded
Open this post in threaded view
|

Re: Tokenizer XML output - Why the extra spaces?

code2win
This post was updated on .
I used regular expression with lookahead and lookbehind, to extract all values from tags, including strings. (notice how I include the space in the (lookbehind[space]) and ([space]lookahead) phrases.
in python using their 're' module of the standard library:
string_value = re.search(r"(?<=<stringConstant> ).+(?= </stringConstant>)", xml_entry)[0]
In this case only 1 space will be discarded from the beginning and end of the enclosed content. But you're right, you can't just call a trim on the edges, and you certainly can't strip a string of all its whitespace.
'.+' here means at least one character, which is preceded by exactly one space and the opening tag, and is followed by exactly one space and the trailing tag: Any other spaces will be included in the extraction, since it belongs to '.+' (includes whitespace chars).
Now might be good time to learn regEx(regular expressions) if you don't already know them. This part of the course is where I picked it up.
Also if you are using an API to parse the XML, there may be some options to tinker with, instead of what I did. I preferred a non-robust solution of just parsing the strings myself(using regEx). Also if you already have something in place to extract the xml entry, you could use the lookahead and lookbehind to just get rid of exactly one space:
"(?<= ).+(?= )"
Reply | Threaded
Open this post in threaded view
|

Re: Tokenizer XML output - Why the extra spaces?

Gerrit0
code2win wrote
string_value = re.search(r"(?<= ).+(?= )", xml_entry)[0]
This is horrible. Python has a built in XML parser. You should use it (or some lib). Even ignoring this, regex is incredibly overkill for stripping off two characters. I can basically guarantee it would have been faster to use the parser when initially developing. Also - it would prevent the bug that you almost certainly have in your regex solution. The XML could be presented all in one line, <stringConstant> a </stringConstant><stringConstant> b </stringConstant> is perfectly valid... but unless you extract xml_entry by looking for opening/closing tags (in which case the regex also doesn't make sense - why not just get the text directly using indices? I'm willing to bet you are looping over lines) your regex will report string_value = "a </stringConstant><stringConstant> b" This is because the .+ match in your regex is greedy. Simply changing it to .+? fixes this.
import xml.etree.ElementTree as etree
from os.path import join, dirname

with open(join(dirname(__file__), './MainT.xml')) as f:
    tree = etree.parse(f)

for el in tree.getroot():
    text = el.text[1:-1] # <-- Just strip off the spaces! No need for regex.
    print(f'{el.tag} -> "{text}"')
This will print out something like:
keyword -> "class"
identifier -> "Main"
symbol -> "{"
keyword -> "function"
keyword -> "void"
identifier -> "main"
symbol -> "("
...
I already have a working parser, I was just curious why the authors decided to include the rather arbitrary spacing around the value of each token instead of directly outputting it.
Reply | Threaded
Open this post in threaded view
|

Re: Tokenizer XML output - Why the extra spaces?

code2win
Oh, the humanity!
Yeah, guys; what's with those spaces?!
No there's no bugs. It worked without a hitch. Project is done.
That's a cool solution. Props.
Reply | Threaded
Open this post in threaded view
|

Re: Tokenizer XML output - Why the extra spaces?

WBahn
Administrator
In reply to this post by Gerrit0
Gerrit0 wrote
After I finished the tokenizer and compared my output to the reference XML, I realized something strange.

The tokenizer output provided includes an extra space before and after each symbol:

<pre>
<tokens>
<keyword> class </keyword>
<identifier> Main </identifier>
...
</pre>

Why does it do this? Yes, XML doesn't care about whitespace between tags, which enables more human-readable documents, but now when reading in the file I have to trim the text content of each token.

Notably, I can't just call <code>.trim()</code>, <code>.strip()</code>, or whatever the language equivalent is that I'm using, because strings might contain trialing spaces, in which case there will be one extra space. To parse the provided output I have to trim exactly one space from each token text to get the content.

<pre>
<stringConstant> HOW MANY NUMBERS?  </stringConstant>
</pre>
(Note two spaces after the question mark.)
Keep in mind that you don't have to write a program that reads the XML file. That output is a just there to give you something to look at to see if your parser is parsing correctly. When you write the code generator, you can think of it as replacing the XML output with the VM output (or you can leave the XML output in place since they don't interact and having that output can be useful for debugging).
Reply | Threaded
Open this post in threaded view
|

Re: Tokenizer XML output - Why the extra spaces?

code2win
Yeah keeping the XML around was really nice for the rest of the process.