A Php Html Parser That Lets Me Do Class Select And Get Parent Nodes
Solution 1:
You could use querypath and then something like this might work:
htmlqp($html)->find("ul.class")->not("#id")
->find('li a[href*="specific"]')->parent()
// then foreach over it or use ->writeHTML() for extraction
See http://api.querypath.org/docs/class_query_path.html for the API.
(Traversing is much easier, if you don't use the fiddly DOMDocument.)
Solution 2:
You can do this with DOMDocument and DOMXPath. Selecting by class in XPath is a pain, but it can be done.
Here is some sample (and totally valid!) HTML:
$html = <<<EOT
<!DOCTYPEhtmlPUBLIC "-//W3C//DTDHTML4.01Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><title>Document Title</title><ulid="myid"><li>myid-listitem1</ul><ulclass="foo
theclass
"><li>list2-item1<li>list2-item2</ul><ulid="myid2"class="foo
theclass bar"><li>list3-item1<li>list3-item2</ul>
EOT
;
$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$nodes = $xp->query("/html/body//ul[not(@id) and contains(concat(' ',normalize-space(@class),' '), ' theclass ')]");
var_dump($nodes->length);
If you are using PHP 5.3, you can simplify this a bit by registering an XPath function in php. (Note that you can register functions for use in XPath expressions by XSLTProcessor
starting at PHP 5.1, but not directly for DOMXPath
.)
functionhasToken($nodearray, $token) {
foreach ($nodearrayas$node) {
if ($node->nodeValue===nullor !hasTokenS($node->nodeValue, $token)) {
returnFalse;
}
}
returnTrue;
// I could even return nodes or document fragments if I wanted!
}
functionhasTokenS($str, $token) {
$str = trim($str, "\r\n\t ");
$tokens = preg_split('/[\r\n\t ]+/', $str);
return in_array($token, $tokens);
}
$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions(array('hasToken', 'hasTokenS'));
// These two are equivalent:$nodes1 = $xp->query("/html/body//ul[not(@id) and php:function('hasToken', @class, 'theclass')]");
$nodes2 = $xp->query("/html/body//ul[not(@id) and php:functionString('hasTokenS', @class, 'theclass')]");
var_dump($nodes1->length);
var_dump($nodes1->item(0));
var_dump($nodes2->length);
var_dump($nodes2->item(0));
If DOMDocument
is just not parsing your HTML very well, you can use the html5lib parser, which will return a DOMDocument:
require_once('lib/HTML5/Parser.php'); // or where-ever you put it$dom = HTML5_Parser::parse($html);
// $dom is a plain DOMDocument object, created according to html5 parsing rules
Solution 3:
I've had good luck with: http://simplehtmldom.sourceforge.net/
Post a Comment for "A Php Html Parser That Lets Me Do Class Select And Get Parent Nodes"