Skip to content Skip to sidebar Skip to footer

A Php Html Parser That Lets Me Do Class Select And Get Parent Nodes

So I'm in a situation where I am scraping a website with PHP and I need to be able to get a node based on it's css class. I need to get a ul tag that doesn't have an id attribute b

Solution 1:

You could use and then something like this might work:

htmlqp($html)->find("ul.class")->not("#id")
             ->find('li a[href*="specific"]')->parent()
// then foreach over it or use ->writeHTML() for extraction

See http://api.querypath.org/docs/class_query_path.html for the API.

(Traversing is much easier, if you don't use the fiddly DOMDocument.)

Solution 2:

You can do this with DOMDocument and DOMXPath. Selecting by class in XPath is a pain, but it can be done.

Here is some sample (and totally valid!) HTML:

$html = <<<EOT
<!DOCTYPEhtmlPUBLIC "-//W3C//DTDHTML4.01Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><title>Document Title</title><ulid="myid"><li>myid-listitem1</ul><ulclass="foo 
theclass
"><li>list2-item1<li>list2-item2</ul><ulid="myid2"class="foo&#xD;theclass bar"><li>list3-item1<li>list3-item2</ul>
EOT
;

$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$nodes = $xp->query("/html/body//ul[not(@id) and contains(concat(' ',normalize-space(@class),' '), ' theclass ')]");

var_dump($nodes->length);

If you are using PHP 5.3, you can simplify this a bit by registering an XPath function in php. (Note that you can register functions for use in XPath expressions by XSLTProcessor starting at PHP 5.1, but not directly for DOMXPath.)

functionhasToken($nodearray, $token) {
    foreach ($nodearrayas$node) {
        if ($node->nodeValue===nullor !hasTokenS($node->nodeValue, $token)) {
            returnFalse;
        }
    }
    returnTrue;
    // I could even return nodes or document fragments if I wanted!
}
functionhasTokenS($str, $token) {
    $str = trim($str, "\r\n\t ");
    $tokens = preg_split('/[\r\n\t ]+/', $str);
    return in_array($token, $tokens);
}

$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions(array('hasToken', 'hasTokenS'));

// These two are equivalent:$nodes1 = $xp->query("/html/body//ul[not(@id) and php:function('hasToken', @class, 'theclass')]");
$nodes2 = $xp->query("/html/body//ul[not(@id) and php:functionString('hasTokenS', @class, 'theclass')]");

var_dump($nodes1->length);
var_dump($nodes1->item(0));
var_dump($nodes2->length);
var_dump($nodes2->item(0));

If DOMDocument is just not parsing your HTML very well, you can use the html5lib parser, which will return a DOMDocument:

require_once('lib/HTML5/Parser.php'); // or where-ever you put it$dom = HTML5_Parser::parse($html);
// $dom is a plain DOMDocument object, created according to html5 parsing rules

Solution 3:

I've had good luck with: http://simplehtmldom.sourceforge.net/

Post a Comment for "A Php Html Parser That Lets Me Do Class Select And Get Parent Nodes"