Backend Development 5 min read

Python Web Scraping: Intelligent Pagination for Batch File Download

The guide explains how to use Python libraries such as requests, pandas, lxml, and regex to automatically paginate through a website, extract PDF file names and URLs, create organized folders, and download all files in bulk with minimal manual effort.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Python Web Scraping: Intelligent Pagination for Batch File Download

This article demonstrates a practical Python web‑scraping workflow for intelligently paginating and batch‑downloading PDF files from a target site (illustrated with the 京客隆 investment reports page).

1. Import required libraries

import requests
import pandas as pd
from lxml import etree
import re
import os

2. Parse the initial page

baseUrl = 'http://www.jkl.com.cn/cn/invest.aspx'  # target page URL
heade = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'
}
res = requests.get(url=baseUrl, headers=heade).text
html = etree.HTML(res)

3. Obtain category names and their URLs

data_name = html.xpath('//div[@class="infoLis"]//a/text()')  # category titles
data_link = html.xpath('//div[@class="infoLis"]//@href')   # category links
name = [n.strip() for n in data_name]
link = ['http://www.jkl.com.cn/cn/' + l for l in data_link]
file = dict(zip(name, link))

4. Create a folder for each category

for name, link in file.items():
    name = name.replace('/', '.').replace('...', '报表')
    path = 'E:/' + name
    if not os.path.exists(path):
        os.mkdir(path)

5. Determine the total number of pages for each category

res_list = requests.get(url=link, headers=heade).text
list_html = etree.HTML(res_list)
weiYe = list_html.xpath('//a[text()="尾页"]/@href')
if weiYe:
    get_weiYe = re.search("(\d+)", weiYe[0])
    get_yeMa = get_weiYe.group(1)
else:
    get_yeMa = 1

6. Extract file names and download links on each page

for page in range(1, int(get_yeMa) + 1):
    yaMa = {'__EVENTTARGET': 'AspNetPager1', '__EVENTARGUMENT': page}
    get_lei_html = requests.get(url=link, headers=heade, params=yaMa).text
    res3 = etree.HTML(get_lei_html)
    pdf_name = res3.xpath('//div[@class="newsLis"]//li/a/text()')
    pdf_url = res3.xpath('//div[@class="newsLis"]//li//@href')

7. Clean the data, build full URLs, and download the PDFs

pdf_names = [n.strip() for n in pdf_name]
if all(pdf_url):
    pdf_urls = ['http://www.jkl.com.cn' + u for u in pdf_url]
    pdf_data = dict(zip(pdf_names, pdf_urls))
    for pdfName, pdfUrl in pdf_data.items():
        pdfName = pdfName.replace('/', '.')
        res_pdf = requests.get(url=pdfUrl, headers=heade).content
        ext = pdfUrl.split('.')[-1]
        pdf_path = path + '/' + pdfName + '.' + ext
        with open(pdf_path, 'wb') as f:
            f.write(res_pdf)
        print(pdfName, '下载成功')

By following these steps and running the provided script, users can automatically retrieve and store all PDF reports from the target site without manual pagination.

PythonAutomationpaginationfile downloadweb scrapingRequestslxml
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.