How to fully parse everything in an XML document?

  parsing, php, xml, xml-parsing

This question has been asked a lot, but the posted answers do not work for me unfortunately.

I am trying to parse custom XML for documentation that has its own DTD and such. My goal is to generate HTML documentation from the XML markup of the documentation. The XML is given and cannot be modified, for all practical purposes.

Generating the HTML is easy – getting the XML into a program so that I can work with it seems to be the challenging part here. I’ve tried many different techniques, and they all seem to fail in some case or another.

  • PHP’s Simple XML parser natively does not contain child attributes (and a lot of other stuff)
  • PHP’s Simple XML parser with json encode/decode cannot handle child nodes that contain attributes
  • This solution I’ve found is the only one that can handle child nodes with attributes, but it doesn’t honor CDATA and basically butchers the entire file
  • Simply casting to array seems reasonable, but also fails to handle child nodes that contain attributes
  • DOM Document gets totally confused and just generates a bunch of formatted plain text.
  • Other general issues include taking children nodes out of context inappropriately. Using CDATA mostly helps with this, but the solutions that handle this fine don’t handle the other things fine.

I was intending to parse the XML into an array, which is theoretically possible, but so far I have not been able to do this successfully.

The XML is 32,000 lines, approximately. The requirement is that I need to capture everything. This includes all attributes of all nodes and all content of all nodes. This includes capturing CDATA literally. Surprisingly, every major parsing solution excludes something.

Short of writing a custom program specifically to parse this particular XML, is there a solution or way to reliably capture everything into an array (or some mechanism that would allow iterating through the whole thing)?

Here is the full XML file for reference:

I’ll point out a few things:

  • I’m preprocessing the file by adding CDATA around certain tags:
$xmlFile = str_replace("<literal>", "<![CDATA[<literal>", $xmlFile);
$xmlFile = str_replace("</literal>", "</literal>]]>", $xmlFile);
$xmlFile = str_replace("<replaceable>", "<![CDATA[<replaceable>", $xmlFile);
$xmlFile = str_replace("</replaceable>", "</replaceable>]]>", $xmlFile);

This is because the end goal is simply to replace these with <span> or <b> or <code> or something like that, and I don’t want these particular nodes parsed as XML. Easy enough. That also requires that CDATA be honored, however.

  • Here is an example of XML that usually fails to parse properly in most solutions:
<application name="Reload" language="en_US">
            Reloads an Asterisk module, blocking the channel until the reload has completed.
            <parameter name="module" required="false">
                <para>The full name(s) of the target module(s) or resource(s) to reload.
                If omitted, everything will be reloaded.</para>
                <para>The full names MUST be specified (e.g. <literal>chan_iax2</literal>
                to reload IAX2 or <literal>pbx_config</literal> to reload the dialplan.</para>
            <para>Reloads the specified (or all) Asterisk modules and reports success or failure.
            Success is determined by each individual module, and if all reloads are successful,
            that is considered an aggregate success. If multiple modules are specified and any
            module fails, then FAILURE will be returned. It is still possible that other modules
            did successfully reload, however.</para>
            <para>Sets <variable>RELOADSTATUS</variable> to one of the following values:</para>
                <variable name="RELOADSTATUS">
                    <value name="SUCCESS">
                        Specified module(s) reloaded successfully.
                    <value name="FAILURE">
                        Some or all of the specified modules failed to reload.

The parsing failure is that SUCCESS and FAILURE are nowhere to be found in the parsed array! This seems to be because most XML parsers ignore attributes in leaf nodes.

  • Another likely requirement is the leaf nodes that themselves contain only text and are contained in a parent that contains other text should not be parsed as separate elements. As an example, in the output above, notice that the variable tag is used in multiple ways. It is used as a formatter similar to literal and replaceable, but also a node type of its own, as in variablelist.

  • The solution needs to be contained within a single script (but I would be okay with installing Debian packages). I’m most familiar with how to do this kind of thing in PHP, but open to other tools, especially if they are POSIX portable.

Ultimately, I’m not looking for the most elegant solution or output, but something that will at least work and fully capture everything. I seem to have exhausted the built-in PHP tools and common answers – any suggestions on how to approach this?

Source: Ask PHP