如何在PHP中解析和处理HTML/XML？

How do you parse and process HTML/XML in PHP?

如何解析HTML/XML并从中提取信息？

本机XML扩展

我更喜欢使用一个本机XML扩展，因为它们与PHP捆绑在一起，通常比所有第三方libs都快，并且给了我对标记所需的所有控制权。好的。DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

Ok.

DOM能够解析和修改真实世界(中断的)HTML，并且可以执行xpath查询。它基于libxml。好的。

使用dom需要一些时间，但在我看来，这是值得的。因为dom是一个语言无关的接口，所以你会发现许多语言的实现，所以如果你需要更改编程语言，那么你很可能已经知道如何使用该语言的dom api了。好的。

获取a元素的href属性可以找到一个基本的用法示例，在php的domdocument中可以找到一个一般的概念概述。好的。

如何使用DOM扩展在stack overflow上已经有了广泛的介绍，所以如果您选择使用它，您可以通过搜索/浏览stack overflow来解决您遇到的大多数问题。好的。解析器

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

Ok.

和dom一样，xmlReader基于libxml。我不知道如何触发HTML解析器模块，因此使用xmlreader解析损坏的HTML可能比使用dom(在dom中您可以明确告诉它使用libxml的HTML解析器模块)更不可靠。好的。

在使用php从h1标记获取所有值时，可以找到一个基本的用法示例。好的。解析器

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

Ok.

XML解析器库也基于libxml，并实现了一个SAX风格的XML推送解析器。对于内存管理来说，它可能是比DOM或SimpleXML更好的选择，但是使用它比使用由xmlReader实现的pull解析器更困难。好的。SimulXML

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

Ok.

当您知道HTML是有效的XHTML时，simpleXML是一个选项。如果需要解析损坏的HTML，甚至不要考虑simpleXML，因为它会阻塞。好的。

一个基本的用法示例可以在一个简单的程序中找到，用于CRUD XML文件的节点和节点值，并且在PHP手册中还有很多其他示例。好的。第三方库(基于libxml)

如果您更喜欢使用第三方lib，我建议使用下面实际使用dom/libxml的lib，而不是字符串解析。好的。Fluentdom-回购

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

Ok.

HTMLPGATEDEM

Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML
documents using It requires DomCrawler from Symfony2
components for traversing the
DOM tree and extends it by adding methods for manipulating the DOM
tree of HTML documents.

Ok.

phpQuery(多年未更新)

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

Ok.

另请参见：https://github.com/electorlinux/phpquery好的。赞多姆

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

Ok.

查询路径

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

Ok.

固定文档

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

Ok.

SABR/XML

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple"xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

Ok.

流式XML< Buff行情>

FluidXML是一个PHP库，用于使用简洁流畅的API操作XML。它利用了xpath和流畅的编程模式，使其变得有趣和有效。好的。< /块引用>第三方(不基于libxml)

在dom/libxml基础上构建的好处是，由于您是基于本机扩展的，所以可以获得良好的性能。然而，并非所有第三方libs都会走这条路。其中一些列在下面好的。PHP简单HTML DOM分析器

An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!

Require PHP 5+.

Supports invalid HTML.

Find tags on an HTML page with selectors just like jQuery.

Extract contents from HTML in a single line.

我一般不推荐这个解析器。代码库很糟糕，解析器本身速度很慢，内存也很匮乏。并非所有jquery选择器(例如子选择器)都是可能的。任何基于libxml的库都应该轻松地优于它。好的。PHP HTML解析器

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrap html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

Ok.

同样，我不推荐使用这个解析器。它相当慢，CPU使用率很高。也没有函数来清除创建的DOM对象的内存。这些问题尤其适用于嵌套循环。文档本身是不准确和拼写错误的，自4月14日以来没有对修复的响应。好的。加农

A universal tokenizer and HTML/XML/RSS DOM Parser

Ability to manipulate elements and their attributes

Supports invalid HTML and UTF8

Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)

A HTML beautifier (like HTML Tidy)

Minify CSS and Javascript

Sort attributes, change character case, correct indentation, etc.

Extensible

Parsing documents using callbacks based on current character/token

Operations separated in smaller functions for easy overriding

Fast and Easy

从未使用过它。不知道有没有什么好处。好的。HTML 5

您可以使用上面的内容来分析HTML5，但是由于标记HTML5允许，可能会有一些奇怪的地方。所以对于HTML5，您需要考虑使用专用的解析器，比如好的。

HTML5LIB好的。

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

Ok.

一旦HTML5完成，我们可能会看到更多专用的解析器。W3还有一个blogpost，标题是如何进行HTML5解析，值得一看。好的。的集成

如果您不想编写PHP，也可以使用Web服务。一般来说，我发现这些工具很少，但这只是我和我的用例。好的。ScraperWiki。

ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

Ok.

正则表达式

最后也是最不推荐的，您可以使用正则表达式从HTML中提取数据。一般来说，不鼓励在HTML上使用正则表达式。好的。

在Web上找到的大多数匹配标记的代码片段都很脆弱。在大多数情况下，它们只适用于非常特殊的HTML部分。微小的标记更改，比如在某个地方添加空白，或者在标记中添加或更改属性，可能会使regex在未正确写入时失败。在HTML上使用regex之前，您应该知道自己在做什么。好的。

HTML解析器已经知道HTML的语法规则。必须为编写的每个新regex教授正则表达式。在某些情况下，regex是可以的，但它确实取决于您的用例。好的。

您可以编写更可靠的解析器，但是当前面提到的库已经存在并且在这方面做得更好时，用正则表达式编写完整可靠的自定义解析器是浪费时间的。好的。

另请参见解析HTML的cthulhu方法好的。书

如果你想花点钱，看看好的。

PHP架构师使用PHP进行Webcraping的指南

我不属于PHP架构师或作者。好的。好啊。

相关讨论

@这取决于你的需要。我不需要CSS选择器查询，这就是为什么我只使用带有xpath的dom。phpQuery的目标是成为jquery端口。Zend_Dom重量轻。你真的必须去看看你最喜欢哪一个。
S/HTML5/HTML/G。任何以前的HTML版本都已经允许使用语法结构HTML5。
@ms2ger主要，但不完全。正如上面已经指出的，您可以使用基于libxml的解析器，但在某些特殊情况下，它们会阻塞。如果您需要最大的兼容性，最好使用专用的解析器。我更喜欢保持这种区别。
您不使用php-simple html-dom解析器的观点似乎没有什么意义。
截至2012年3月29日，dom不支持html5，xmlreader不支持html，最后一次提交html5lib for php是在2009年9月。解析HTML5、HTML4和XHTML时使用什么？
@上面的Shiplu答案列出了我知道的所有选项。DOM可以解析任何具有模式或DTD的内容。HTML5没有(官方)。
只是为了增加一些经验：我已经使用了其中的一些，并且现在总是推荐Ganon，因为在我的大多数情况下，它实际上比本地版本更快，因为它是如何工作的，而且对于无效/损坏/不完整的文档也能很好地工作(我所知道的其他任何文档都无法处理)。有时，仅仅重新编写自己的或使用regex也是值得的，但是只有当您有非常特殊和简单的需求时(例如，必须以固定格式只支持2个标签)。
@吉米，它没有包含任何关于curl的内容，因为curl不是用来解析和处理HTML/XML的工具。curl是各种网络协议的客户端。例如，您可以用它来获取网站。上面的大多数库都有直接加载远程URL的方法，所以根本不需要curl，例如dom有loadHTMLFile()。
关于第三方库(基于libxml)，我发现：-QueryPath对我不起作用，因为它阻塞了格式不正确的HTML(甚至使用htmlqp())-phpQuery有点难接近，此外-html5lib有一个非常活跃的python部分，但是如果你在寻找一个快速而肮脏的解决方案，php端口的维护似乎很低，我可以推荐github.com/hk12369/php-html-解析器
@Gordon我建议添加symfony的"cssselector"组件，以便将基于"css selector"的dom爬行添加到domdocument(如stackoverflow.com/questions/3577641/&hellip；中所述)和symfony的"domcrawler"组件，具体取决于您是希望对dom进行低级访问，还是希望采用更高级的方法。
记得？不能使用正则表达式分析(x)HTML！(在我读到它的那一天，我突然想到在HTML旁边提到regex是一种罪恶。)
@纳沙一世故意将臭名昭著的扎尔戈大话排除在上面的名单之外，因为它本身并没有太大的帮助，而且自它被写下来后，会导致相当多的货运狂热。无论regex作为解决方案多么合适，人们都会被该链接击倒。要获得更平衡的意见，请查看我包含的链接，并浏览stackoverflow.com/questions/4245008/&hellip；上的评论。
此列表中明显缺少终极Web scraper工具包的tagfilter类。多年来我一直使用简单的HTML DOM，因为它是我能找到的最可靠的一致性。TagFilter是我最初写的东西，因为我需要能够清晰地处理Word HTML，但后来我意识到我可以用更灵活、可扩展(处理多MB HTML文件而不发生内存泄漏)和更快的东西来替换简单的HTML DOM和HTML净化器。在1MB+HTML净化器库的情况下，它要小得多，而且是独立的。它也得到了维护。

尝试简单的HTML DOM分析器

一个用php 5+编写的HTML DOM解析器，可以让您以非常简单的方式操作HTML！
需要PHP 5 +。
支持无效的HTML。
使用类似jquery的选择器在HTML页面上查找标记。
在一行中从HTML提取内容。
下载

< BR>实例：

如何获取HTML元素：

1
2
3
4
5
6
7
8
9
10

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images
foreach($html->find('img') as $element)
echo $element->src . '';

// Find all links
foreach($html->find('a') as $element)
echo $element->href . '';

< BR>

如何修改HTML元素：

1
2
3
4
5
6
7
8

// Create DOM from string
$html = str_get_html('HelloWorld');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html;

< BR>

从HTML提取内容：

1 2	// Dump contents (without tags) from HTML echo file_get_html('http://www.google.com/')->plaintext;

< BR>

刮斜线点：

1
2
3
4
5
6
7
8
9
10
11
12

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}

print_r($articles);

相关讨论

只需使用domDocument->loadHTML()就可以了。libxml的HTML解析算法非常好和快速，与流行的观点相反，它不会阻塞格式错误的HTML。

相关讨论

为什么不应该，什么时候应该使用正则表达式？

首先，一个常见的误称是：regexps不用于"解析"HTML。然而，正则表达式可以"提取"数据。提取是他们的目的。在适当的SGML工具包或基线XML解析器上，regex HTML提取的主要缺点是它们的语法工作和不同的可靠性。

考虑一下做一个有点可靠的HTML提取regex：

1 2	]+id="(\d+)".+? <a\s+class="[\w\s]title [\w\s]"[^>]+href="(http://[^">]+)"[^>]*>([^<>]+).+?

是否比简单的phpQuery或queryPath等效项更不可读：

1	$div->find(".stationcool a")->attr("title");

但是，在特定的用例中，它们可以提供帮助。

许多DOM遍历前端不显示HTML注释(.+?)/预先提取一段HTML，并使用更简单的HTML解析器前端处理其余部分。

注意：我实际上有这个应用程序，我在其中交替使用XML解析和正则表达式。就在上周，pyquery解析中断了，regex仍然工作。是的，很奇怪，我自己也解释不了。但事情发生了。所以请不要因为现实世界的考虑与regex=evil meme不匹配而投反对票。但我们也不要把这个投票太多。这只是这个话题的旁注。

相关讨论

phpquery和querypath在复制fluent jquery API时非常相似。这也是为什么它们是正确解析PHP中HTML最简单的两种方法。

querypath示例

基本上，首先从HTML字符串创建一个可查询的DOM树：

1	$qp = qp("<html><body>title..."); // or give filename or URL

结果对象包含HTML文档的完整树表示形式。它可以使用DOM方法进行遍历。但常用的方法是使用类似jquery中的css选择器：

1
2
3
4
5

$qp->find("div.classname")->children()->...;

foreach ($qp->find("p img") as $img) {
print qp($img)->attr("src");
}

大多数情况下，您希望对->find()使用简单的#id和.class或DIV标记选择器。但您也可以使用xpath语句，这有时更快。另外，典型的jquery方法，如->children()和->text()，特别是->attr()，简化了提取正确的HTML片段。(并且已经对其SGML实体进行了解码。)

1	$qp->xpath("//div/p[1]"); // get first paragraph in a div

querypath还允许向流中注入新的标记(->append)，随后输出并美化更新的文档(->writeHTML)。它不仅可以解析格式错误的HTML，还可以解析各种XML方言(带有名称空间)，甚至可以从HTML微格式(XFN、VCARD)中提取数据。

1	$qp->find("a[target=_blank]")->toggleClass("usability-blunder");

phpQuery还是queryPath？

通常，querypath更适合于操作文档。虽然phpQuery还实现了一些伪Ajax方法(仅HTTP请求)，以更接近jQuery。据说phpquery通常比querypath快(因为总体特性较少)。

有关差异的更多信息，请参阅tagbyte.org中回程机器上的比较。(原始资料丢失，所以这里有一个互联网档案链接。是的，你仍然可以找到丢失的页面，人。)

下面是一个全面的查询路径介绍。

优势

简单可靠
易于使用的替代方案->find("a img, a object, div a")
正确的数据取消捕获(与正则表达式grepping相比)

简单的HTML DOM是一个很好的开源解析器：

simplehtmldom.sourceforge

它以面向对象的方式处理DOM元素，并且新的迭代对不兼容的代码有很多覆盖。还有一些很好的函数，如您在javascript中看到的"find"函数，它将返回该标记名元素的所有实例。

我已经在很多工具中使用了它，在许多不同类型的网页上测试它，我认为它工作得很好。

我在这里没有提到的一个一般方法是通过tidy运行HTML，它可以设置为吐出保证有效的XHTML。然后您可以在上面使用任何旧的XML库。

但是对于您的具体问题，您应该看看这个项目：http://fivefilters.org/content-only/—这是一个可读取性算法的修改版本，它设计为只从页面提取文本内容(而不是页眉和页脚)。

对于1a和2：我会投票给新的symfony组件类domcrawler(domcrawler)。此类允许类似于CSS选择器的查询。请看本演示文稿，了解真实世界的示例：News-of-the-Symfon2-World。

该组件设计为独立工作，可以在没有symfony的情况下使用。

唯一的缺点是它只能与php 5.3或更新版本一起使用。

相关讨论

顺便说一下，这通常被称为屏幕抓取。我用于这个的库是简单的HTMLDOM解析器。

相关讨论

我们以前为自己的需要创造了不少爬虫。最后，通常是简单的正则表达式做得最好。虽然上面列出的库是创建它们的好原因，但是如果您知道要查找什么，正则表达式是一种更安全的方式，因为您还可以处理无效的HTML/XHTML结构，如果通过大多数分析器加载，这些结构将失败。

我推荐使用PHP简单HTML DOM解析器。

它确实有很好的功能，比如：

1 2	foreach($html->find('img') as $element) echo $element->src . '';

这听起来是W3C XPath技术的一个很好的任务描述。很容易表达诸如"返回img标记中嵌套在 elements中的所有href属性"之类的查询。"不是一个php爱好者，我不能告诉你xpath的可用形式。如果您可以调用外部程序来处理HTML文件，那么您应该能够使用xpath的命令行版本。有关快速介绍，请参阅http://en.wikipedia.org/wiki/xpath。

使用dom而不是字符串分析的simplehtmldom的第三方替代方案：phpquery、zend_dom、querypath和fluentdom。

相关讨论

是的，您可以使用简单的HTML DOM。不过，我对简单的HTML DOM做了很多工作，尤其是在Web废弃方面，发现它太脆弱了。它做基本的工作，但我无论如何都不推荐它。

我从来没有用过卷发，但我学到的是，卷发能更有效地完成工作，而且更坚固。

请查看以下链接：使用curl搜索网站

相关讨论

querypath很好，但是要小心"跟踪状态"，因为如果您没有意识到它的含义，这可能意味着您要浪费大量的调试时间来找出发生了什么以及代码为什么不能工作。

这意味着对结果集的每个调用都会修改对象中的结果集，它不像jquery中的链接那样是可链接的，其中每个链接都是一个新的集，您有一个单独的集，它是查询的结果，每个函数调用都会修改这个单独的集。

为了获得类似jquery的行为，您需要在执行类似于filter/modify的操作之前进行分支，这意味着它将更紧密地反映jquery中发生的情况。

1 2	$results = qp("div p"); $forename = $results->find("input[name='forename']");

$results现在包含的是input[name='forename']的结果集，而不是原来的查询"div p"的结果集，这让我很吃惊，我发现querypath跟踪过滤器和查找，以及修改结果并将其存储在对象中的所有内容。你应该改为这样做

1	$forename = $results->branch()->find("input[name='forname']")

那么，$results将不会被修改，您可以一次又一次地重用结果集，也许有更多知识的人可以将其清除一点，但从我的发现来看，基本上是这样的。

高级HTML DOM是一个简单的HTML DOM替换，它提供了相同的接口，但它是基于DOM的，这意味着没有发生任何相关的内存问题。

它还具有完整的CSS支持，包括jquery扩展。

相关讨论

对于HTML5，HTML5 lib已经被废弃多年。我能从最近的更新和维护记录中找到的唯一一个html5库是html5php，它在一个多星期前刚被引入beta1.0。

我创建了一个名为phppowertools/dom query的库，它允许您像对jquery那样对HTML5和XML文档进行爬网。

在引擎盖下，它使用symfony/domcrawler将css选择器转换为xpath选择器。它总是使用相同的DOMDocument，即使在将一个对象传递给另一个对象时，也可以确保良好的性能。

实例使用：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

namespace PowerTools;

// Get file content
$htmlcode = file_get_contents('https://github.com');

// Define your DOMCrawler based on file string
$H = new DOM_Query($htmlcode);

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query($H->select('body'));

// Passing a string (CSS selector)
$s = $H->select('div.foo');

// Passing an element object (DOM Element)
$s = $H->select($documentBody);

// Passing a DOM Query object
$s = $H->select( $H->select('p + p'));

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
return $i ." -" . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
return $i ." -" . $val->attr('class') ." - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section></section>');

[...]

支持的方法：

[X] $(1)
[X] $PARSEHTML
[X] $
帕西森
[X]$选择.add
[X]$selection.addclass选择
[X]$选择之后。
[X]$选择.append
[X]$选择.attr
[X]$选择之前。
[X]$selection.children选择
[X]$selection.closest(最近)
[X]$selection.contents(选择内容)
[X]$选择.detach
[X]$selection.每个
[X]$选择.eq
[X]$selection.empty(2)
[X]$选择.find
[X]$选择.first
[X]$selection.get
[X]$选择.insertafter
[X]$selection.insertbefore选择
[X]$选择.last
[X]$selection.parent选择
[X]$selection.parents选择
[X]$选择。删除
[X]$selection.removeattr
[X]$selection.removeClass
[X]$选择.text
[X]$selection.wrap包

已重命名为"select"，原因显而易见。
重命名为"void"，因为"empty"是PHP中的保留字

注：

该库还包括自己的零配置自动加载器，用于与PSR-0兼容的库。所包含的示例应该是开箱即用的，不需要任何附加配置。或者，您可以将其与作曲家一起使用。

相关讨论

我已经编写了一个通用的XML解析器，可以轻松地处理GB文件。它基于xmlReader，并且非常容易使用：

1
2
3
4
5

$source = new XmlExtractor("path/to/tag","/path/to/file.xml");
foreach ($source as $tag) {
echo $tag->field1;
echo $tag->field2->subfield1;
}

这是Github报告：xmlExtractor

您可以尝试使用类似HTMLTidy的工具来清除任何"损坏的"HTML，并将HTML转换为XHTML，然后使用XML解析器进行解析。

您可以尝试的另一个选项是querypath。它受到jquery的启发，但是在PHP的服务器上，在Drupal中使用。

XML_HTMLSax相当稳定，即使不再维持。另一种选择可能是通过HTML整齐地传输HTML，然后用标准的XML工具解析它。

有很多方法可以处理HTML/XML DOM，其中大部分已经提到过。因此，我不会试图列出这些。

我只想补充一点，我个人更喜欢使用DOM扩展，为什么：

IIT充分利用了底层C代码的性能优势
它是oo php(并允许我对其进行子类化)
这是相当低的水平(允许我把它作为更高级行为的非膨胀基础)。
它提供对DOM每个部分的访问(与SimpleXML不同，SimpleXML忽略了一些不太知名的XML特性)
它有一个用于DOM爬行的语法，类似于本机JavaScript中使用的语法。

虽然我错过了在DOMDocument中使用css选择器的能力，但是有一种非常简单和方便的方法来添加这个特性：将DOMDocument子类化，并将诸如querySelectorAll和querySelector方法之类的JS添加到子类中。

为了解析选择器，我建议使用symfony框架中非常简单的cssseelector组件。这个组件只是将css选择器转换为xpath选择器，然后将其输入DOMXpath以检索相应的nodelist。

然后，您可以使用这个(仍然非常低的)子类作为更高级别类的基础，意在解析特定类型的XML或添加更多类似jQuery的行为。

下面的代码直接输出了我的DOM查询库，并使用了我描述的技术。

对于HTML分析：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

namespace PowerTools;

use \Symfony\Component\CssSelector\CssSelector as CssSelector;

class DOM_Document extends \DOMDocument {
public function __construct($data = false, $doctype = 'html', $encoding = 'UTF-8', $version = '1.0') {
parent::__construct($version, $encoding);
if ($doctype && $doctype === 'html') {
@$this->loadHTML($data);
} else {
@$this->loadXML($data);
}
}

public function querySelectorAll($selector, $contextnode = null) {
if (isset($this->doctype->name) && $this->doctype->name == 'html') {
CssSelector::enableHtmlExtension();
} else {
CssSelector::disableHtmlExtension();
}
$xpath = new \DOMXpath($this);
return $xpath->query(CssSelector::toXPath($selector, 'descendant::'), $contextnode);
}

[...]

public function loadHTMLFile($filename, $options = 0) {
$this->loadHTML(file_get_contents($filename), $options);
}

public function loadHTML($source, $options = 0) {
if ($source && $source != '') {
$data = trim($source);
$html5 = new HTML5(array('targetDocument' => $this, 'disableHtmlNsInDom' => true));
$data_start = mb_substr($data, 0, 10);
if (strpos($data_start, '<!DOCTYPE ') === 0 || strpos($data_start, '<html>') === 0) {
$html5->loadHTML($data);
} else {
@$this->loadHTML('<!DOCTYPE html><html><head><meta charset="' . $encoding . '" /></head><body></body></html>');
$t = $html5->loadHTMLFragment($data);
$docbody = $this->getElementsByTagName('body')->item(0);
while ($t->hasChildNodes()) {
$docbody->appendChild($t->firstChild);
}
}
}
}

[...]
}

另请参见symfony的创建者fabien potencier使用css选择器解析XML文档，了解他决定为symfony创建cssselector组件以及如何使用它。

symfony框架有可以解析HTML的捆绑包，您可以使用css样式来选择doms，而不是使用xpath。

使用FluidXML，您可以使用xpath和css选择器查询和迭代XML。

1
2
3
4
5
6
7
8
9
10
11

$doc = fluidxml('<html>...</html>');

$title = $doc->query('//head/title')[0]->nodeValue;

$doc->query('//body/p', 'div.active', '#bgId')
->each(function($i, $node) {
// $node is a DOMNode.
$tag = $node->nodeName;
$text = $node->nodeValue;
$class = $node->getAttribute('class');
});

https://github.com/servo-php/fluidxml

XML中的JSON和数组分为三行：

1
2
3

$xml = simplexml_load_string($xml_string);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

塔达！

不使用正则表达式解析HTML有几个原因。但是，如果您完全控制将生成什么HTML，那么您可以使用简单的正则表达式。

上面是一个通过正则表达式解析HTML的函数。请注意，此函数非常敏感，要求HTML遵守某些规则，但在许多情况下都非常有效。如果您想要一个简单的解析器，并且不想安装库，请试一试：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

function array_combine_($keys, $values) {
$result = array();
foreach ($keys as $i => $k) {
$result[$k][] = $values[$i];
}
array_walk($result, create_function('&$v', '$v = (count($v) == 1)? array_pop($v): $v;'));

return $result;
}

function extract_data($str) {
return (is_array($str))
? array_map('extract_data', $str)
: ((!preg_match_all('#<([A-Za-z0-9_]*)[^>]*>(.*?)</\1>#s', $str, $matches))
? $str
: array_map(('extract_data'), array_combine_($matches[1], $matches[2])));
}

print_r(extract_data(file_get_contents("http://www.google.com/")));

我创建了一个名为html5 dom document的库，可以在https://github.com/ivopetkov/html5-dom-document-php上免费获取。

它也支持查询选择器，我认为这对您的情况非常有用。下面是一些示例代码：

1
2
3

$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML('<!DOCTYPE html><html><body>HelloThis is some text</body></html>');
echo $dom->querySelector('h1')->innerHTML;

解析XML的最佳方法：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

$xml='http://www.example.com/rss.xml';
$rss = simplexml_load_string($xml);
$i = 0;
foreach ($rss->channel->item as $feedItem) {
$i++;
echo $title=$feedItem->title;
echo '';
echo $link=$feedItem->link;
echo '';
if($feedItem->description !='') {$des=$feedItem->description;} else {$des='';}
echo $des;
echo '';
if($i>5) break;
}

如果您熟悉jquery选择器，可以使用scarletsquery for php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

[cc lang="php"]<?php
include"ScarletsQuery.php";

// Load the HTML content and parse it
$html = file_get_contents('https://www.lipsum.com');
$dom = Scarlets\Library\MarkupLanguage::parseText($html);

// Select meta tag on the HTML header
$description = $dom->selector('head meta[name="description"]')[0];

// Get 'content' attribute value from meta tag
print_r($description->attr('content'));

$description = $dom->selector('#Content p');

// Get element array
print_r($description->view);

此库处理脱机HTML通常需要不到1秒钟的时间。它还接受标记属性上的无效HTML或缺少引号。