Generating content from DOMDocument

This document is written regarding an internal feature for DOMDocuments if Laminas is not present.

As of netcurl 6.1.5, we start to laborate with DOMDocument again. As we are using html reading to generate RSS-links, DOMDocument and DOMElement plays an important part in fetching content instead of using regex-fetching as we in that case have to pars html tags and items manually. Instead we use this other method. In the test suite and this example we use a stored html-page from moviezine, which does not generate RSS data themselves. When we write this, our own wish is to be able to fetch all articles from their autogenerated list of news. We know that they have to kinds of elements where they store the content which also has classes applied to the element.

Features are available from the master branch.

In this particular case, we use xpath as the tasks (

Error rendering macro 'jira' : null
/ #5) is based on. The elements we want to look for is:


Class XPathDescription
//*[@class="inner_article"]/a
The is very much based on the container for featured articles and will in our case return three articles.

//*[@class="articles_wrapper"]/a
After the featured articles, each article container has this class as the "main" class.

In both above cases we are looking for specific data inside those element containers which is explained here. Also here, we use xpath, but a bit different as some of the elements we look for has more than one class applied. Those two "sub-xpaths" will be appended to the nodelist generated by the above classes.

Class XPath (Sub)Description
/*[contains(@class, "subtitle")]
This class, subtitle, contains the shorter title of the article.
/*[contains(@class, "lead")]
This class, lead, is the longer article text under the bolded titles for each element.

So, to sum up this far, we want to find two kind of elements by xpath, containing several elements that should be rendered into an array that could lead out to an initial RSS-feed. The example above covers this code in the test suite.

        $elements = [
            'subtitle' => '/*[contains(@class, "subtitle")]',
            'lead' => '/*[contains(@class, "lead")]',
        ];

        $xData = GenericParser::getFromXPath(
            file_get_contents(__DIR__ . '/templates/domdocument_mz.html'),
            [
                '//*[@class="inner_article"]/a',
                '//*[@class="articles_wrapper"]/a',
            ]
        );

Now, we need to start render an array containing the data found in the xpaths. First of all, we need to get the nodes. This is done by sending the content (array) in $xData to an elements parser:

        $nodeInfo = GenericParser::getElementsByXPath($xData, $elements, ['href', 'value']);

This function renders a long list of nodes that you can use very much on your own. mainNode is the current node and the subNode is a node that goes one child up. For example, the inner_article contains a <a href>-tag. The data about this tag resides in the mainNode as you can see in the image. But to get a properly formatted title value (compared to innerHtml or innerText) we want to extract some of the values from the subNode. The extraction that will  be done is decided by the last array (where you see href and value) and the elements that should be find lives in the $elements array.

In short this happens (example):

  • Scan inner_article and articles_wrapper for all <a> tags.
  • When found, look further in the <a>-tags found, for class elements named subtitle and lead.
  • When elements with subtitle and lead, extract values based on href and value.
  • Merge everything into $nodeInfo, based on each element containing inner_article and articles_wrapper.

When this is done, we have everything we need to render an array. This is done with the foreach loop for $nodeInfo. To properly generate content, we now use GenericParser::getValuesFromXPath, which is basically a recursive fetcher for mainNode and subNode:

$href = GenericParser::getValuesFromXPath($node, ['subtitle', 'mainNode', 'href']);

In this query, getValuesFromXPath follows the subtitle element, and proceeds to fetch the mainNode. In mainNode the two above requested values href and value can now be extracted. In this case we'd like to have the href. If we instead wants the article title, we will look up the value in the subNode:

$hrefText = GenericParser::getValuesFromXPath($node, ['subtitle', 'subNode', 'value']);

We can after this extraction collect each article element in a more "human friendly" array and start rendering content.

foreach ($nodeInfo as $node) {
            $href = GenericParser::getValuesFromXPath($node, ['subtitle', 'mainNode', 'href']);
            $hrefText = GenericParser::getValuesFromXPath($node, ['subtitle', 'subNode', 'value']);
            $description = GenericParser::getValuesFromXPath($node, ['lead', 'subNode', 'value']);
            if (!empty($href)) {
                $articles[$href] = [
                    'title' => $hrefText,
                    'description' => $description,
                ];
            }
        }

The above example renders this array. From here on and forward, it will be much easier to handle. The main reason for why we do like this in the current example is to avoid duplicate hrefs.

What if the above work is horrible?

It can be handled in one call also, as long as it is as "standard" as possible. Changes and adaptions may follow.

This is a oneshot and executes all above actions in one call.

$nodeList = GenericParser::getContentFromXPath(
            file_get_contents(__DIR__ . '/templates/domdocument_mz.html'),
            [
                '//*[@class="inner_article"]/a',
                '//*[@class="articles_wrapper"]/a',
            ],
            [
                'subtitle' => '/*[contains(@class, "subtitle")]',
                'lead' => '/*[contains(@class, "lead")]',
            ],
            ['href', 'value'],
            ['subtitle' => 'mainNode', 'lead' => 'subNode'],
        );

The nodeList, when successful, will contain two variables:

VariableDescription
nodeInfoRaw node info.
renderedThe rendered array.