Backend Development 5 min read

Python Web Scraping: Intelligent Pagination for Batch File Download

The guide explains how to use Python libraries such as requests, pandas, lxml, and regex to automatically paginate through a website, extract PDF file names and URLs, create organized folders, and download all files in bulk with minimal manual effort.

Python Programming Learning Circle

Oct 7, 2021

Python Web Scraping: Intelligent Pagination for Batch File Download

This article demonstrates a practical Python web‑scraping workflow for intelligently paginating and batch‑downloading PDF files from a target site (illustrated with the 京客隆 investment reports page).

1. Import required libraries

import requests
import pandas as pd
from lxml import etree
import re
import os

2. Parse the initial page

baseUrl = 'http://www.jkl.com.cn/cn/invest.aspx'  # target page URL
heade = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'
}
res = requests.get(url=baseUrl, headers=heade).text
html = etree.HTML(res)

3. Obtain category names and their URLs

data_name = html.xpath('//div[@class="infoLis"]//a/text()')  # category titles
data_link = html.xpath('//div[@class="infoLis"]//@href')   # category links
name = [n.strip() for n in data_name]
link = ['http://www.jkl.com.cn/cn/' + l for l in data_link]
file = dict(zip(name, link))

4. Create a folder for each category

for name, link in file.items():
    name = name.replace('/', '.').replace('...', '报表')
    path = 'E:/' + name
    if not os.path.exists(path):
        os.mkdir(path)

5. Determine the total number of pages for each category

res_list = requests.get(url=link, headers=heade).text
list_html = etree.HTML(res_list)
weiYe = list_html.xpath('//a[text()="尾页"]/@href')
if weiYe:
    get_weiYe = re.search("(\d+)", weiYe[0])
    get_yeMa = get_weiYe.group(1)
else:
    get_yeMa = 1

6. Extract file names and download links on each page

for page in range(1, int(get_yeMa) + 1):
    yaMa = {'__EVENTTARGET': 'AspNetPager1', '__EVENTARGUMENT': page}
    get_lei_html = requests.get(url=link, headers=heade, params=yaMa).text
    res3 = etree.HTML(get_lei_html)
    pdf_name = res3.xpath('//div[@class="newsLis"]//li/a/text()')
    pdf_url = res3.xpath('//div[@class="newsLis"]//li//@href')

7. Clean the data, build full URLs, and download the PDFs

pdf_names = [n.strip() for n in pdf_name]
if all(pdf_url):
    pdf_urls = ['http://www.jkl.com.cn' + u for u in pdf_url]
    pdf_data = dict(zip(pdf_names, pdf_urls))
    for pdfName, pdfUrl in pdf_data.items():
        pdfName = pdfName.replace('/', '.')
        res_pdf = requests.get(url=pdfUrl, headers=heade).content
        ext = pdfUrl.split('.')[-1]
        pdf_path = path + '/' + pdfName + '.' + ext
        with open(pdf_path, 'wb') as f:
            f.write(res_pdf)
        print(pdfName, '下载成功')

By following these steps and running the provided script, users can automatically retrieve and store all PDF reports from the target site without manual pagination.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

pagination File Download web-scraping lxml

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.