Get XML content that is not in tags and divide them into array by tags in front of them

Here's my XML content:

<paragraph>
    <textInfo number="1" />Example text one.<textInfo number="2"/>Example text two
</paragraph>

I would like to parse it and create array like this:

$array = (
    1 => "Example text one",
    2 => "Example text two"
);

I tried this:

$xml = simplexml_load_file($file);
var_dump(explode("<textInfo/>", $xml));

Result was array with only one key, so in explode function probably doesn't see html tags:

array(1) {
  [0]=>
  string(37) "
    Example text one.Example text two
"
}

Also tried this, but it gives me only two empty objects:

$paragraphs = $xml->xpath('//textInfo');

Can you suggest solution, please?

Answers 1

  • The explode() casts the paragraph SimpleXMLElement into a string. This returns the text content.

    $xml = <<<'XML'
    <paragraph>
        <textInfo number="1" />Example text one.<textInfo number="2"/>Example text two
    </paragraph>
    XML;
    $p = simplexml_load_string($xml);
    var_dump($p->getName(), (string)$p);
    

    Output:

    string(9) "paragraph"
    string(39) "
        Example text one.Example text two
    "
    

    You can use text() in Xpath expressions to address text nodes. However this does not seem to work with SimpleXML. It returns the parent element node:

    $p = simplexml_load_string($xml);
    $text = $p->xpath('/paragraph/text()')[0];
    var_dump($text->getName(), (string)$text);
    

    Output:

    string(9) "paragraph"
    string(39) "
        Example text one.Example text two
    "
    

    So you might need to use DOM. In DOM anything is a node. This allows you to get the separate text nodes:

    $document = new DOMDocument();
    $document->loadXML($xml);
    $xpath = new DOMxpath($document);
    
    foreach ($xpath->evaluate('/paragraph/text()') as $text) {
        var_dump($text->textContent);
    }
    

    Output:

    string(5) "
        "
    string(17) "Example text one."
    string(17) "Example text two
    "
    

    The first text node in this example is the line break and the indent spaces before the first <textInfo/>. Here is a method to recognize that kind of text nodes:

    $lines = [];
    foreach ($xpath->evaluate('/paragraph/text()') as $text) {
        if (!$text->isWhitespaceInElementContent()) {
            $lines[] = $text->textContent;
        }
    }
    var_dump($lines);
    

    Output:

    array(2) {
      [0]=>
      string(17) "Example text one."
      [1]=>
      string(17) "Example text two
    "
    }
    

Related Articles