Integrating Scrapy with Selenium for Dynamic Web Page Crawling
This guide explains how to combine Scrapy and Selenium to scrape dynamically rendered web pages, covering installation, project setup, middleware configuration, Selenium driver handling, and code examples that demonstrate a complete end‑to‑end crawling workflow.
Background : When using Scrapy to crawl websites, dynamic content loaded by JavaScript is often missed; Selenium can render such pages, allowing Scrapy to retrieve the missing data.
Scrapy Overview : Scrapy is a Python framework for extracting structured data from websites, suitable for data mining, API crawling, and general web scraping.
Installation & Project Setup :
pip install Scrapy
scrapy startproject project_name
cd project_name
scrapy genspider spider_name spider_domainThe project directory includes spiders , items.py , middlewares.py , pipelines.py , settings.py , and scrapy.cfg .
Scrapy Execution Flow (engine → scheduler → downloader → spider → pipelines) is illustrated with a step‑by‑step description of request handling and middleware processing.
Selenium Overview : Selenium automates browsers, simulating user actions such as typing, clicking, and executing JavaScript, and is essential for rendering JavaScript‑driven pages.
Selenium Installation :
pip install seleniumDriver Installation : Download the appropriate ChromeDriver version from http://npm.taobao.org/mirrors/chromedriver/ (or Opera/IE drivers as needed) ensuring version compatibility with the browser.
Using Requests for Simple Pages :
import requests
header = {'User-Agent': 'Mozilla/5.0 ...'}
url = "https://192.168.1.1/aqistudy/monthdata.php?city=北京"
res = requests.get(url, headers=header)
if res.status_code == 200:
print("请求成功")
with open("aqistudy.txt", "w+", encoding="utf8") as f:
f.write(res.text)
else:
print("请求失败")Scrapy + Selenium Integration :
import scrapy
class ApistudyMainSpider(scrapy.Spider):
name = 'apistudy_main'
allowed_domains = ['192.168.1.1']
def start_requests(self):
start_url = r'https://192.168.1.1/aqistudy/monthdata.php?city=北京'
yield scrapy.Request(url=start_url, callback=self.parse, dont_filter=True)
def parse(self, response):
yield {'text': response.text}Pipeline to Save Data :
class AqistudyPipeline(object):
def open_spider(self, spider):
self.file = open('my.html', 'w', encoding='utf-8')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
self.file.write(str(item['text']))Custom Middleware (RandomHeaderMiddleWare) uses random User‑Agents and launches a headless Chrome instance to fetch the rendered page, then returns an HtmlResponse to Scrapy:
class RandomHeaderMiddleWare:
def __init__(self):
self.user_agents = USER_AGENTS
def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.user_agents)
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.add_argument('--disable-gpu')
option.add_argument('no-sandbox')
option.add_argument('disable-blink-features=AutomationControlled')
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = webdriver.Chrome(options=option)
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"})
driver.get(request.url)
driver.implicitly_wait(5)
content = driver.page_source
driver.quit()
return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)Settings add the custom middleware and disable ROBOTSTXT_OBEY :
SPIDER_MIDDLEWARES = {'aqistudy.middlewares.AqistudySpiderMiddleware': 543}
DOWNLOADER_MIDDLEWARES = {
'aqistudy.middlewares.AqistudyDownloaderMiddleware': 543,
'aqistudy.middlewares.RandomHeaderMiddleWare': 545,
}
ROBOTSTXT_OBEY = FalseResult : After running the Scrapy project, the saved HTML contains a fully rendered table with weather data (month, AQI, pollutant levels, etc.), demonstrating that Scrapy + Selenium can bypass anti‑scraping measures and extract dynamic content.
Conclusion : For pages that render data via JavaScript, combining Scrapy with Selenium provides a powerful solution; although Selenium slows down crawling, the approach can be scaled with tools like scrapy‑redis for distributed scraping.
360 Quality & Efficiency
360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.