一直以来网页解析和爬虫的制作热情丝毫未减 今天用开源的simple_html_dom.php解析框架做了一只爬虫:
<?php
/*
*.Pho spider v1.0
*.Written by Radish.ghost 2015.1.20
*/
//error_reporting(1); //close error report
//curl model //I will realize it in later versions
include_once("simple_html_dom.php");
$html=file_get_html('http://www.baidu.com');//The url which you want dig
$tmp=array();//Save the url in the first dig
foreach($html->find('a') as $e)
{
$f=$e->href;
//if($f[10]==':')continue;
if($f[0]=='/')$f='http://www.baidu.com'.$f;//Completion the url
if($f[4]=='s')continue;//If the url is "https://" continue (the simple_html_dom might can't prase the https:// url)
if(stripos($f,"baidu")==FALSE)continue;//If the url not in this website continue
echo $f . '<br>';
$tmp[$cun++]=$f; //Save the urls into array
}
foreach($tmp as $r) //Dig the urls in $tmp[]
{
$html2=file_get_html($r); //Redo the step
foreach($html2->find('a') as $a)
{
$u=$a->href;
if($u[0]=='/')$u='http://www.baidu.com'.$u;
if($u[4]=='s')continue;
if(stripos($u,"baidu")==FALSE)continue;
echo $u.'<br>';
}
$html2=null;
}
?>//最后总会出现一个Fatal error: Call to a member function find() on a non-object in D:\xampp\htdocs\html\index.php on line 21 的警告 与学长沟通后改正了很多小错误 不过这个仍然没有解决 希望有大神能够指点一下
---------------------分割线---------------------
simple_html_dom下载:
https://github.com/Ph0enixxx/simple_html_dom
= =家里电脑用不了git4win
以上就介绍了 php 自制基于simple_html_dom的爬虫一只v1.0,包括了方面的内容,希望对PHP教程有兴趣的朋友有所帮助。
PHP怎么学习?PHP怎么入门?PHP在哪学?PHP怎么学才快?不用担心,这里为大家提供了PHP速学教程(入门到精通),有需要的小伙伴保存下载就能学习啦!
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号