A PHP Web Crawler: Design, Implementation, and Challenges
This article describes a PHP‑based web crawler that extracts links and images using regular expressions, stores URLs in MySQL, handles duplicate detection via MD5, discusses performance limitations, and provides the full source code and usage instructions.
To simplify large‑scale data collection, the author built a PHP web crawler that has already fetched nearly one million pages and now faces storage and retrieval challenges.
The crawler works by downloading a page, extracting all hyperlinks and image sources with regular expressions, and recursively fetching those links.
MySQL is used for persistent storage because it supports efficient queries, and PHP is chosen for its built‑in PCRE support, easy MySQL connectivity, and cross‑platform deployment.
Regular expressions used for link and image extraction:
<code>#<a[^>]+href=(['"])(.+)\1#isU // process links</code><code>#<img[^>]+src=(['"])(.+)\1#isU // process images</code>To avoid re‑downloading the same URL and to prevent cycles, the crawler records the MD5 hash of each processed URL in the database; this can be replaced by more sophisticated algorithms.
The crawler respects the robots.txt protocol in principle, though this feature is not yet implemented.
Usage example (command line):
<code>php -f spider.php 20</code>The main PHP script includes functions for fetching pages with cURL, extracting URLs, converting relative URLs to absolute ones, filtering out external domains, and deduplicating URLs based on parameters. It also manages a temporary file "/tmp/url.txt" to store the URL queue.
<code><?php
#加载页面
function curl_get($url){
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_HEADER,1);
$result=curl_exec($ch);
$code=curl_getinfo($ch,CURLINFO_HTTP_CODE);
if($code!='404' && $result){
return $result;
}
curl_close($ch);
}
#获取页面url链接
function get_page_urls($spider_page_result,$base_url){
$get_url_result=preg_match_all("/<[a|A].*?href=['"]{0,1}([^>'"]*)[^>]*>/",$spider_page_result,$out);
if($get_url_result){
return $out[1];
}else{
return;
}
}
#相对路径转绝对路径
function xdtojd($base_url,$url_list){
if(is_array($url_list)){
foreach($url_list as $url_item){
if(preg_match("/^(http:\/\/|https:\/\/|javascript:)/",$url_item)){
$result_url_list[]=$url_item;
}else{
if(preg_match("/^\//",$url_item)){
$real_url = $base_url.$url_item;
}else{
$real_url = $base_url."/".$url_item;
}
$result_url_list[] = $real_url;
}
}
return $result_url_list;
}else{
return;
}
}
#删除其他站点url
function other_site_url_del($jd_url_list,$url_base){
if(is_array($jd_url_list)){
foreach($jd_url_list as $all_url){
if(strpos($all_url,$url_base)===0){
$all_url_list[]=$all_url;
}
}
return $all_url_list;
}else{
return;
}
}
#删除相同URL
function url_same_del($array_url){
if(is_array($array_url)){
$insert_url=array();
$pizza=file_get_contents("/tmp/url.txt");
if($pizza){
$pizza=explode("\r\n",$pizza);
foreach($array_url as $array_value_url){
if(!in_array($array_value_url,$pizza)){
$insert_url[]=$array_value_url;
}
}
if($insert_url){
foreach($insert_url as $key => $insert_url_value){
$update_insert_url=preg_replace('/=[^&]*/','=leesec',$insert_url_value);
foreach($pizza as $pizza_value){
$update_pizza_value=preg_replace('/=[^&]*/','=leesec',$pizza_value);
if($update_insert_url==$update_pizza_value){
unset($insert_url[$key]);
continue;
}
}
}
}
}else{
$insert_url=$array_url;
foreach($insert_url as $insert_url_value){
$update_insert_url=preg_replace('/=[^&]*/','=leesec',$insert_url_value);
$insert_new_url[]=$update_insert_url;
}
$insert_new_url=array_unique($insert_new_url);
foreach($insert_new_url as $key => $insert_new_url_val){
$insert_url_bf[]=$insert_url[$key];
}
$insert_url=$insert_url_bf;
}
return $insert_url;
}else{
return;
}
}
$current_url=$argv[1];
$fp_puts = fopen("/tmp/url.txt","ab"); //记录url列表
$fp_gets = fopen("/tmp/url.txt","r"); //读取url列表
$url_base_url=parse_url($current_url);
if($url_base_url['scheme']==""){
$url_base="http://".$url_base_url['host'];
}else{
$url_base=$url_base_url['scheme']."://".$url_base_url['host'];
}
do{
$spider_page_result=curl_get($current_url);
$url_list=get_page_urls($spider_page_result,$url_base);
if(!$url_list){
continue;
}
$jd_url_list=xdtojd($url_base,$url_list);
$result_url_arr=other_site_url_del($jd_url_list,$url_base);
var_dump($result_url_arr);
$result_url_arr=url_same_del($result_url_arr);
if(is_array($result_url_arr)){
$result_url_arr=array_unique($result_url_arr);
foreach($result_url_arr as $new_url){
fputs($fp_puts,$new_url."\r\n");
}
}
}while ($current_url = fgets($fp_gets,1024)); //不断获得url
preg_match_all("/<a[^>]+href=[\"']([^\"']+)[\"'][^>]+>/",$spider_page_result,$out);
?>
</code>Because the crawler processes URLs sequentially, it performs well on small datasets but slows down significantly with large histories; indexing improves speed but loading tens of thousands of records remains costly, and the script lacks multithreading.
To run the crawler, create a MySQL database named net_spider , import the provided db.sql schema, configure credentials in config.php , and execute the script with the desired depth and start URL.
Overall, building a functional crawler is straightforward, but efficient storage, retrieval, and scalability of the collected data present the real challenges.
php中文网 Courses
php中文网's platform for the latest courses and technical articles, helping PHP learners advance quickly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.