php提取网页正文内容的例子_.docx

资源描述

《php提取网页正文内容的例子_.docx》由会员分享，可在线阅读，更多相关《php提取网页正文内容的例子_.docx（14页珍藏版）》请在三一文库上搜索。

1、php提取网页正文内容的例子_ 由于难点在于如何去识别并保留网页中的文章部分，而且删除其它无用的信息，并且要做到通用化，不能像火车头那样依据目标站来制定采集规章，由于搜索引擎结果中有各种的网页。抓回一个页面的数据，如何匹配出正文部分，郑晓在下班路上想了个思路是： 1. 提取出body标签部分剔除全部链接剔除全部script、说明剔除全部空白标签(包括标签内不含中文的)猎取结果。 2. 挺直匹配出非链接的、符合在div、p、h标签中的中文部分? 还是会有不少其它多余信息啊，比如底部信息等。如何搞?不知道大家有木有什么思路或建议? 这个类是从网上找到的一个php实现的提取网页正文部分的算法，

2、郑晓在本地也测试了下，精准率特别高。代码如下: ?php class Readability / 保存判定结果的标记位名称 const ATTR_CONTENT_SCORE = contentScore; / DOM 解析类目前只支持 UTF-8 编码 const DOM_DEFAULT_CHARSET = utf-8; / 当判定失败时显示的内容 const MESSAGE_CAN_NOT_GET = Readability was unable to parse this page for content.; / DOM 解析类（PHP5 已内置） protected $DOM = nu

3、ll; / 需要解析的源代码 protected $source = ; / 章节的父元素列表 private $parentNodes = array(); / 需要删除的标签 / Note: added extra tags from private $junkTags = Array(style, form, iframe, script, button, input, textarea, noscript, select, option, object, applet, basefont, bgsound, blink, canvas, command, menu, nav, data

4、list, embed, frame, frameset, keygen, label, marquee, link); / 需要删除的属性 private $junkAttrs = Array(style, class, onclick, onmouseover, align, border, margin); /* * 构造函数 * param $input_char 字符串的编码。默认 utf-8，可以省略 */ function _construct($source, $input_char = utf-8) $this-source = $source; / DOM 解析类只能处理

5、UTF-8 格式的字符 $source = mb_convert_encoding($source, HTML-ENTITIES, $input_char); / 预处理 HTML 标签，剔除冗余的标签等 $source = $this-preparSource($source); / 生成 DOM 解析类 $this-DOM = new DOMDocument(1.0, $input_char); try /libxml_use_internal_errors(true); / 会有些错误信息，不过没关系 :) if ( encoding=.Readability:DOM_DEFAULT_C

6、HARSET.$source) throw new Exception(Parse HTML Error!); foreach ($this-DOM-childNodes as $item) if ($item-nodeType = XML_PI_NODE) $this-DOM-removeChild($item); / remove hack / insert proper $this-DOM-encoding = Readability:DOM_DEFAULT_CHARSET; catch (Exception $e) / . /* * 预处理 HTML 标签，使其能够精准被 DOM 解析

7、类处理 * * return String */ private function preparSource($string) / 剔除多余的 HTML 编码标记，避开解析出错 preg_match(/charset=(w|-+);?/, $string, $match); if (isset($match1) $string = preg_replace(/charset=(w|-+);?/, , $string, 1); / Replace all doubled-up BR tags with P tags, and remove fonts. $string = preg_replac

8、e(/br/? rns*br/?/i, /pp, $string); $string = preg_replace(/?font*/i, , $string); / see / - from $string = preg_replace(#script(.*?)(.*?)/script#is, , $string); return trim($string); /* * 删除 DOM 元素中全部的 $TagName 标签 * * return DOMDocument */ private function removeJunkTag($RootNode, $TagName) $Tags = $

9、RootNode-getElementsByTagName($TagName); /Note: always index 0, because removing a tag removes it from the results as well. while($Tag = $Tags-item(0) $parentNode = $Tag-parentNode; $parentNode-removeChild($Tag); return $RootNode; /* * 删除元素中全部不需要的属性 */ private function removeJunkAttr($RootNode, $Att

10、r) $Tags = $RootNode-getElementsByTagName(*); $i = 0; while($Tag = $Tags-item($i+) $Tag-removeAttribute($Attr); return $RootNode; /* * 依据评分猎取页面主要内容的盒模型 * 判定算法来自： * 这里由郑晓博客转发 * return DOMNode */ private function getTopBox() / 获得页面全部的章节 $allParagraphs = $this-DOM-getElementsByTagName(p); / Study all t

11、he paragraphs and find the chunk that has the best score. / A score is determined by things like: Number of ps, commas, special classes, etc. $i = 0; while($paragraph = $allParagraphs-item($i+) $parentNode = $paragraph-parentNode; $contentScore = intval($parentNode-getAttribute(Readability:ATTR_CONT

14、h found / Add points for any commas within this paragraph if (strlen($paragraph-nodeValue) 10) $contentScore += strlen($paragraph-nodeValue); / 保存父元素的判定得分 $parentNode-setAttribute(Readability:ATTR_CONTENT_SCORE, $contentScore); / 保存章节的父元素，以便下次快速猎取 array_push($this-parentNodes, $parentNode); $topBox

15、= null; / Assignment from index for performance. / See for ($i = 0, $len = sizeof($this-parentNodes); $i $len; $i+) $parentNode = $this-parentNodes$i; $contentScore = intval($parentNode-getAttribute(Readability:ATTR_CONTENT_SCORE); $orgContentScore = intval($topBox ? $topBox-getAttribute(Readability

16、:ATTR_CONTENT_SCORE) : 0); if ($contentScore $contentScore $orgContentScore) $topBox = $parentNode; / 此时，$topBox 应为已经判定后的页面内容主元素 return $topBox; /* * 猎取 HTML 页面标题 * * return String */ public function getTitle() $split_point = - ; $titleNodes = $this-DOM-getElementsByTagName(title); if ($titleNodes-l

17、ength $titleNode = $titleNodes-item(0) / see $title = trim($titleNode-nodeValue); $result = array_map(strrev, explode($split_point, strrev($title); return sizeof($result) 1 ? array_pop($result) : $title; return null; /* * Get Leading Image Url * * return String */ public function getLeadImageUrl($no

18、de) $images = $node-getElementsByTagName(img); if ($images-length $leadImage = $images-item(0) return $leadImage-getAttribute(src); return null; /* * 猎取页面的主要内容（Readability 以后的内容） * * return Array */ public function getContent() if (!$this-DOM) return false; / 猎取页面标题 $ContentTitle = $this-getTitle();

19、 / 猎取页面主内容 $ContentBox = $this-getTopBox(); /Check if we found a suitable top-box. if($ContentBox = null) throw new RuntimeException(Readability:MESSAGE_CAN_NOT_GET); / 复制内容到新的 DOMDocument $Target = new DOMDocument; $Target-appendChild($Target-importNode($ContentBox, true); / 删除不需要的标签 foreach ($this

20、-junkTags as $tag) $Target = $this-removeJunkTag($Target, $tag); / 删除不需要的属性 foreach ($this-junkAttrs as $attr) $Target = $this-removeJunkAttr($Target, $attr); $content = mb_convert_encoding($Target-saveHTML(), Readability:DOM_DEFAULT_CHARSET, HTML-ENTITIES); / 多个数据，以数组的形式返回 return Array( lead_image_

21、url = $this-getLeadImageUrl($Target), word_count = mb_strlen(strip_tags($content), Readability:DOM_DEFAULT_CHARSET), title = $ContentTitle ? $ContentTitle : null, content = $content ); function _destruct() 用法起来也特别简洁，实例化时传入网页的html源码和相应的编码，然后挺直调用其getContent方法即可返回提取到的正文部分，提取出的文章中可能还会含有少部分链接，可以自己后期再修改更多信息请查看IT技术专栏 .

展开阅读全文