code2win wrote
string_value = re.search(r"(?<= ).+(?= )", xml_entry)[0]
This is horrible. Python has a built in XML parser. You should use it (or some lib). Even ignoring this, regex is incredibly overkill for stripping off two characters. I can basically guarantee it would have been faster to use the parser when initially developing.
Also - it would prevent the bug that you almost certainly have in your regex solution. The XML could be presented all in one line,
<stringConstant> a </stringConstant><stringConstant> b </stringConstant>
is perfectly valid... but unless you extract
xml_entry
by looking for opening/closing tags (in which case the regex also doesn't make sense - why not just get the text directly using indices? I'm willing to bet you are looping over lines) your regex will report
string_value = "a </stringConstant><stringConstant> b"
This is because the
.+
match in your regex is greedy. Simply changing it to
.+?
fixes this.
import xml.etree.ElementTree as etree
from os.path import join, dirname
with open(join(dirname(__file__), './MainT.xml')) as f:
tree = etree.parse(f)
for el in tree.getroot():
text = el.text[1:-1] # <-- Just strip off the spaces! No need for regex.
print(f'{el.tag} -> "{text}"')
This will print out something like:
keyword -> "class"
identifier -> "Main"
symbol -> "{"
keyword -> "function"
keyword -> "void"
identifier -> "main"
symbol -> "("
...
I already have a working parser, I was just curious why the authors decided to include the rather arbitrary spacing around the value of each token instead of directly outputting it.