Bypassing Anti‑Scraping Mechanisms: User‑Agent Spoofing and IP Rate Limiting with Python
This article explains how to overcome common anti‑scraping defenses such as identity verification and IP rate limiting by spoofing the User‑Agent header and adding request delays, providing complete Python code examples using requests and BeautifulSoup to scrape Douban's Top 250 movies.
When crawling websites, many sites employ anti‑scraping mechanisms such as identity verification and IP rate limiting, which can block simple HTTP requests.
(1) Identity verification – Websites check the request headers, especially the User‑Agent field, to distinguish browsers from bots. By inspecting the browser’s network panel you can copy the full User‑Agent string and send it with your request.
Example of a basic request that receives no data because the default Python requests header is identified as a bot:
import requests
# Douban Top 250 URL
url = 'https://movie.douban.com/top250'
res = requests.get(url)
print(res.text)Adding a realistic User‑Agent resolves the block:
import requests
headers = {
'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
url = 'https://movie.douban.com/top250'
res = requests.get(url, headers=headers)
print(res.text)(2) IP rate limiting – Excessive request frequency can trigger IP bans. To stay under the limit, insert delays between requests, typically using time.sleep() , and parse pages with BeautifulSoup .
import requests
import time
from bs4 import BeautifulSoup
def get_douban_movie(url):
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
items = soup.find_all('div', class_='hd')
for i in items:
tag = i.find('a')
name = tag.find(class_='title').text
link = tag['href']
print(name, link)
url_template = 'https://movie.douban.com/top250?start={}&filter='
urls = [url_template.format(num*25) for num in range(10)]
for item in urls:
get_douban_movie(item)
time.sleep(1) # pause to avoid being blockedThe two illustrated methods—spoofing the User‑Agent and throttling request frequency—provide simple yet effective ways to bypass common anti‑scraping defenses.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.