Backend Development 6 min read

Bypassing Anti‑Scraping Mechanisms: User‑Agent Spoofing and IP Rate Limiting with Python

This article explains how to overcome common anti‑scraping defenses such as identity verification and IP rate limiting by spoofing the User‑Agent header and adding request delays, providing complete Python code examples using requests and BeautifulSoup to scrape Douban's Top 250 movies.

Python Programming Learning Circle

Jul 14, 2021

Bypassing Anti‑Scraping Mechanisms: User‑Agent Spoofing and IP Rate Limiting with Python

When crawling websites, many sites employ anti‑scraping mechanisms such as identity verification and IP rate limiting, which can block simple HTTP requests.

(1) Identity verification – Websites check the request headers, especially the User‑Agent field, to distinguish browsers from bots. By inspecting the browser’s network panel you can copy the full User‑Agent string and send it with your request.

Example of a basic request that receives no data because the default Python requests header is identified as a bot:

import requests

# Douban Top 250 URL
url = 'https://movie.douban.com/top250'
res = requests.get(url)
print(res.text)

Adding a realistic User‑Agent resolves the block:

import requests

headers = {
    'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
url = 'https://movie.douban.com/top250'
res = requests.get(url, headers=headers)
print(res.text)

(2) IP rate limiting – Excessive request frequency can trigger IP bans. To stay under the limit, insert delays between requests, typically using time.sleep(), and parse pages with BeautifulSoup.

import requests
import time
from bs4 import BeautifulSoup

def get_douban_movie(url):
    headers = {
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    res = requests.get(url, headers=headers)
    soup = BeautifulSoup(res.text, 'html.parser')
    items = soup.find_all('div', class_='hd')
    for i in items:
        tag = i.find('a')
        name = tag.find(class_='title').text
        link = tag['href']
        print(name, link)

url_template = 'https://movie.douban.com/top250?start={}&filter='
urls = [url_template.format(num*25) for num in range(10)]
for item in urls:
    get_douban_movie(item)
    time.sleep(1)  # pause to avoid being blocked

The two illustrated methods—spoofing the User‑Agent and throttling request frequency—provide simple yet effective ways to bypass common anti‑scraping defenses.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Web Scraping requests User-Agent beautifulsoup anti-scraping IP throttling

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.