Python Web Scraping: Intelligent Pagination for Batch File Download
The guide explains how to use Python libraries such as requests, pandas, lxml, and regex to automatically paginate through a website, extract PDF file names and URLs, create organized folders, and download all files in bulk with minimal manual effort.
This article demonstrates a practical Python web‑scraping workflow for intelligently paginating and batch‑downloading PDF files from a target site (illustrated with the 京客隆 investment reports page).
1. Import required libraries
import requests
import pandas as pd
from lxml import etree
import re
import os2. Parse the initial page
baseUrl = 'http://www.jkl.com.cn/cn/invest.aspx' # target page URL
heade = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'
}
res = requests.get(url=baseUrl, headers=heade).text
html = etree.HTML(res)3. Obtain category names and their URLs
data_name = html.xpath('//div[@class="infoLis"]//a/text()') # category titles
data_link = html.xpath('//div[@class="infoLis"]//@href') # category links
name = [n.strip() for n in data_name]
link = ['http://www.jkl.com.cn/cn/' + l for l in data_link]
file = dict(zip(name, link))4. Create a folder for each category
for name, link in file.items():
name = name.replace('/', '.').replace('...', '报表')
path = 'E:/' + name
if not os.path.exists(path):
os.mkdir(path)5. Determine the total number of pages for each category
res_list = requests.get(url=link, headers=heade).text
list_html = etree.HTML(res_list)
weiYe = list_html.xpath('//a[text()="尾页"]/@href')
if weiYe:
get_weiYe = re.search("(\d+)", weiYe[0])
get_yeMa = get_weiYe.group(1)
else:
get_yeMa = 16. Extract file names and download links on each page
for page in range(1, int(get_yeMa) + 1):
yaMa = {'__EVENTTARGET': 'AspNetPager1', '__EVENTARGUMENT': page}
get_lei_html = requests.get(url=link, headers=heade, params=yaMa).text
res3 = etree.HTML(get_lei_html)
pdf_name = res3.xpath('//div[@class="newsLis"]//li/a/text()')
pdf_url = res3.xpath('//div[@class="newsLis"]//li//@href')7. Clean the data, build full URLs, and download the PDFs
pdf_names = [n.strip() for n in pdf_name]
if all(pdf_url):
pdf_urls = ['http://www.jkl.com.cn' + u for u in pdf_url]
pdf_data = dict(zip(pdf_names, pdf_urls))
for pdfName, pdfUrl in pdf_data.items():
pdfName = pdfName.replace('/', '.')
res_pdf = requests.get(url=pdfUrl, headers=heade).content
ext = pdfUrl.split('.')[-1]
pdf_path = path + '/' + pdfName + '.' + ext
with open(pdf_path, 'wb') as f:
f.write(res_pdf)
print(pdfName, '下载成功')By following these steps and running the provided script, users can automatically retrieve and store all PDF reports from the target site without manual pagination.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.