customr

… streams

the customr development blog - notes and musings on web development

Posts Tagged ‘xmlreader’

Consuming XML, fast, with PHP and XMLReader

Sunday, May 3rd, 2009

Let’s face it, XML isn’t the lightest of data serialisation formats out there. Consider and compare this:

<alternate_description>something else<alternate_description>

against this, in JSON

{ alternate_description : "something else" }

Those repetitive XML tags are really just extra bytes to download and parse. Unfortunately, sometimes, we have to consume huge gobs of XML for a project and for that we have XMLReader, the lesser known cousin of SimpleXML.

Unlike SimpleXML, which consumes the entire document before making it available for parsing, XMLReader “acts as a cursor going forward on the document stream and stopping at each node on the way” (php.net/xmlreader). Kind of like a line-by-line CSV parser but acting on the nodes of an XML document.
Choosing the right XML parser for the job is very important, as if you don’t choose correctly it can lead to unwanted and avoidable performance issues on your server.

To illustrate this, I pointed both SimpleXML and XMLReader at the same 190MB XML document via a PHP shell script, ran two tests on each extension and profiled the results.  Test one found a node at the start of the file, the other test found a node at the end of the file.

The XML document in question is a standard XML document containing 21467 records, it looks something like this:

<persons>
   <person>
       <name>John</name>
       <!-- other nodes -->
   </person>
   <!-- 21466 person nodes -->
</persons>

Peak memory usage is measured by the “top” command (%MEM).

SimpleXML

Test One:
Nodes : 1
Peak Memory Usage: 18%
Processed 190MB of XML in 3.14164 seconds

Test Two:
Nodes : 21467
Peak Memory Usage: 18%
Processed 190MB of XML in 3.20796 seconds

XMLReader

Test One:
Nodes: 1
Peak Memory Usage: 0.3%
Processed 190MB of XML in 0.00128 seconds

Test Two:
Nodes : 21467
Peak Memory Usage: 0.7%
Processed 190MB of XML in 16.4478 seconds

These results really give an indication of the different uses of both extensions.

XMLReader flew through finding the first element in no time at all while SimpleXML took about the same time to find the first and the last element. The big difference is memory — XMLReader performed about 50 times better than SimpleXML.
Understandably, XMLReader took a lot longer to find the last node as it had to process each node in the document until it found a match. A seek() method on the XMLReader class would obviously be useful here to skip unwanted nodes.

Use cases

For simple parsing such as RSS feed handling and small XML documents SimpleXML is definitely the way to go. It’s easy access to document nodes is a great advantage.
For larger document importing, XMLReader wins hands-down due to its ability to read the document node by node with limited impact on system memory, in fact you can parse XML documents with XMLReader that are larger than the available system memory.

One final tip: avoid building large data structures while processing large XML documents with XMLReader as it defeats the purpose of using XMLReader in the first place — just grab the data needed to perform an operation and skip to the next iteration.

Other Resources