Xerces v Flex
What is the fastest way to process and XML file? I was faced with this question when I recently wanted to process a 452GiB XML file; for this amount of data speed matters. Some obvious choices were XML libraries, hand-crafted code, and lexical analyzer generators.
I dismissed hand-crafted code, because the processing wasn't trivial, and reasoned I had better things to do with my time than reinventing the wheel. Lexical analyzer generators, like lex used to have a rotten reputation regarding their time performance. Programmers often wrote that they hand-crafted a lexical analyzer, for the sake of efficiency. However, I thought that flex advertised as a fast lexical analyzer generator would generate, well, fast code. Because I also had to process the text data inside the XML tags in a non-trivial way I decided to adopt flex both for processing the XML file and the text data.
Here are some representative flex rules I used:
%x TEXT
%%
"<title>"[^<:]*"</title>" title(yytext);
"<timestamp>"[^<]*"</timestamp>" timestamp(yytext);
"<username>"[^<]*"</username>" username(yytext);
"<ip>" ip();
"<page>" page_begin();
"<text xml:space=\"preserve\">" BEGIN(TEXT);
<TEXT>"</text>" BEGIN(INITIAL);
"</page>" page_end();
"<restrictions>" restrictions();
BEGIN(...)
)
for creating a specialized text scanner inside the XML scanner.
A golden rule of code optimization is to measure before arriving to conclusions. Although I adopted the flex approach to simplify development, I was curious to see how the time performance of this approach compared with the performance of an XML parsing library. I reasoned that flex's clever generated code optimized for a specific set of XML tags would surely outperform a general-purpose XML parser that would at the very least also have to check that the document was well-formed. I downloaded the apache Xerces-C++ XML Parser and compared its performance against the flex code by running the SAXCount sample (with validation disabled) against the first five million lines of my document (processing about 150000 XML tags).
The results surprised me. All three flex performance times
real 23.411s user 11.232s sys 1.028swere inferior to those of Xerces.
real 18.303s user 8.355s sys 0.523sMy conclusion is that Xerces is very good, and that flex is not really that fast; clearly there is ample scope for developing lexical analyzers that create faster code. Hats off to the Xerces team! Read and post comments