Backend Development 12 min read

Integrating Scrapy with Selenium for Dynamic Web Page Crawling

This guide explains how to combine Scrapy and Selenium to scrape dynamically rendered web pages, covering installation, project setup, middleware configuration, Selenium driver handling, and code examples that demonstrate a complete end‑to‑end crawling workflow.

360 Quality & Efficiency

Jul 2, 2021

Integrating Scrapy with Selenium for Dynamic Web Page Crawling

Background : When using Scrapy to crawl websites, dynamic content loaded by JavaScript is often missed; Selenium can render such pages, allowing Scrapy to retrieve the missing data.

Scrapy Overview : Scrapy is a Python framework for extracting structured data from websites, suitable for data mining, API crawling, and general web scraping.

Installation & Project Setup :

pip install Scrapy
scrapy startproject project_name
cd project_name
scrapy genspider spider_name spider_domain

The project directory includes spiders, items.py, middlewares.py, pipelines.py, settings.py, and scrapy.cfg.

Scrapy Execution Flow (engine → scheduler → downloader → spider → pipelines) is illustrated with a step‑by‑step description of request handling and middleware processing.

Selenium Overview : Selenium automates browsers, simulating user actions such as typing, clicking, and executing JavaScript, and is essential for rendering JavaScript‑driven pages.

Selenium Installation : pip install selenium Driver Installation : Download the appropriate ChromeDriver version from http://npm.taobao.org/mirrors/chromedriver/ (or Opera/IE drivers as needed) ensuring version compatibility with the browser.

Using Requests for Simple Pages :

import requests
header = {'User-Agent': 'Mozilla/5.0 ...'}
url = "https://192.168.1.1/aqistudy/monthdata.php?city=北京"
res = requests.get(url, headers=header)
if res.status_code == 200:
    print("请求成功")
    with open("aqistudy.txt", "w+", encoding="utf8") as f:
        f.write(res.text)
else:
    print("请求失败")

Scrapy + Selenium Integration :

import scrapy
class ApistudyMainSpider(scrapy.Spider):
    name = 'apistudy_main'
    allowed_domains = ['192.168.1.1']
    def start_requests(self):
        start_url = r'https://192.168.1.1/aqistudy/monthdata.php?city=北京'
        yield scrapy.Request(url=start_url, callback=self.parse, dont_filter=True)
    def parse(self, response):
        yield {'text': response.text}

Pipeline to Save Data :

class AqistudyPipeline(object):
    def open_spider(self, spider):
        self.file = open('my.html', 'w', encoding='utf-8')
    def close_spider(self, spider):
        self.file.close()
    def process_item(self, item, spider):
        self.file.write(str(item['text']))

Custom Middleware (RandomHeaderMiddleWare) uses random User‑Agents and launches a headless Chrome instance to fetch the rendered page, then returns an HtmlResponse to Scrapy:

class RandomHeaderMiddleWare:
    def __init__(self):
        self.user_agents = USER_AGENTS
    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agents)
        option = webdriver.ChromeOptions()
        option.add_argument('--headless')
        option.add_argument('--disable-gpu')
        option.add_argument('no-sandbox')
        option.add_argument('disable-blink-features=AutomationControlled')
        option.add_experimental_option('excludeSwitches', ['enable-automation'])
        driver = webdriver.Chrome(options=option)
        driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"})
        driver.get(request.url)
        driver.implicitly_wait(5)
        content = driver.page_source
        driver.quit()
        return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)

Settings add the custom middleware and disable ROBOTSTXT_OBEY:

SPIDER_MIDDLEWARES = {'aqistudy.middlewares.AqistudySpiderMiddleware': 543}
DOWNLOADER_MIDDLEWARES = {
    'aqistudy.middlewares.AqistudyDownloaderMiddleware': 543,
    'aqistudy.middlewares.RandomHeaderMiddleWare': 545,
}
ROBOTSTXT_OBEY = False

Result : After running the Scrapy project, the saved HTML contains a fully rendered table with weather data (month, AQI, pollutant levels, etc.), demonstrating that Scrapy + Selenium can bypass anti‑scraping measures and extract dynamic content.

Conclusion : For pages that render data via JavaScript, combining Scrapy with Selenium provides a powerful solution; although Selenium slows down crawling, the approach can be scaled with tools like scrapy‑redis for distributed scraping.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python middleware Web Scraping Scrapy Dynamic Pages

Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.