Overview of Web Crawling, Anti‑Crawling Techniques, and 58 Anti‑Crawling System
This article introduces the fundamentals of web crawlers, typical crawling methods, and a comprehensive set of anti‑crawling strategies—including IP control, browser and device simulation, CAPTCHA cracking, and traffic analysis—while detailing the architecture and capabilities of the 58 anti‑crawling platform.
0x00 Introduction
Web crawlers, also known as spiders or bots, simulate network protocols to retrieve target data on a large scale over long periods. A basic crawler starts from a single link, continuously collecting pages and expanding to newly discovered URLs, while focused crawlers target specific content structures.
Crawlers increase server load and can expose sensitive resources such as real‑estate listings, recruitment data, or used‑car information. Exploiting business‑logic or system vulnerabilities, crawlers may also harvest user, merchant, or platform data, leading to information‑leakage incidents and legal issues.
0x01 Search Engines
Major search engines (Google, Baidu, 360, Bing) use crawlers that identify themselves via User‑Agent strings, e.g., Baidu PC UA: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) . Since UA strings can be forged, relying solely on them is insufficient; host verification and behavioral analysis are also needed.
Robots Protocol
The Robots Exclusion Protocol (Robots.txt) tells crawlers which pages may be accessed or must be avoided. Although legitimate crawlers respect it, the protocol is voluntary and cannot enforce compliance.
0x02 Typical Crawling Techniques
Crawlers are produced by various actors: students and hobbyists, data‑service companies, commercial competitors, and uncontrolled bots running on compromised servers. Python is the most common language, with libraries such as Scrapy, BeautifulSoup, pyquery, and Mechanize.
Data‑service companies offer custom data sets and crawling services. Competitors may scrape each other’s platforms for competitive analysis. Uncontrolled bots may reside on cloud servers or infected machines, operating without supervision.
Setting Request Frequency
Limiting crawl frequency reduces server load, but sophisticated crawlers randomize sleep intervals to evade simple rate‑limiting.
Proxy IPs
Crawlers often use multi‑threaded, distributed approaches with rotating proxy IPs—free or paid—to bypass IP‑based blocks and to overcome CAPTCHA challenges by changing IP addresses.
Browser Spoofing
By randomizing User‑Agent strings or using full browser automation (e.g., headless Chrome, PhantomJS), crawlers can evade UA‑based detection. Some tools (Octoparse, Firefly) embed real browser engines to pass advanced checks.
Device Simulation
Device fingerprints (JS‑generated or SDK‑based) uniquely identify browsers or apps. Anti‑crawling can combine IP and fingerprint data, but simulated fingerprints can also be generated to bypass checks.
CAPTCHA Cracking
CAPTCHAs are a primary barrier; attackers use manual solving, machine‑learning recognition, or third‑party solving services to bypass them.
Network Parameter Forgery
Advanced crawlers may set or forge cookies, Referer headers, and other HTTP parameters to mimic genuine traffic.
0x03 Common Anti‑Crawling Countermeasures
IP Controls
Rate‑limit per IP, with granularity for time windows, regions, page types, and protocol variations.
Browser Detection
Inspect User‑Agent, plugin list, language, WebGL, and other browser‑specific properties to differentiate real browsers from bots.
Network Parameter Checks
Validate cookies, Referer, and other headers; distinguish between WEB, APP, and mobile clients.
CAPTCHA Enforcement
Deploy image, sliding‑puzzle, click, SMS, or voice CAPTCHAs, possibly combined with behavioral biometrics.
Device Fingerprinting
Collect SDK‑based or JS‑based fingerprints to detect emulators, rooted devices, or repeated identifiers.
Web‑Side Techniques
Use JS obfuscation, encrypted scripts, asynchronous Ajax/Fetch calls, hidden or dummy links, CSS tricks, IFRAME loading, and dynamic HTML changes to hinder scraping.
Behavioral Analysis
Compare access patterns such as localStorage usage, request bursts, and parameter traversal to distinguish bots from human users.
API Rate Limiting
Set per‑IP or per‑fingerprint thresholds, encrypt API payloads, and embed data‑level monitoring.
Account‑Based Controls
Enforce login requirements, limit per‑account request frequency, device count, and geographic access.
Security Portraits
58’s security portrait service combines big‑data threat intelligence with risk‑control to provide pre‑alert, real‑time detection, and post‑incident forensics, integrating multiple risk tags (IP, device, account, phone).
0x04 58 Anti‑Crawling System Overview
The 58 Anti‑Crawling SCF service offers low‑cost, rapid integration, handling nearly 1 billion requests daily with a baseline throughput of ~10 k RPS and an average latency of 0.5 ms. It covers real‑estate, recruitment, classifieds, and related business lines.
Clients connect via the SCF gateway; a strategy management system configures rule sets; an analysis engine executes strategies and forwards hits to a decision engine; real‑time monitoring and a big‑data platform provide analytics.
The strategy management system enables batch automation of generic policies, while the real‑time monitoring module alerts on abnormal traffic.
Risk penalties consider dimensions such as UID, cookie, IP, and device fingerprint.
Interception methods include various CAPTCHAs, fake data responses, and interrupt pages.
0x05 Anti‑Crawling Traffic Analysis Platform
Traffic analysis based on Nginx logs identifies malicious crawlers, bots, and simulators across PC, mobile, and app channels, providing alerts, target identification, and trend monitoring. It offers heat‑maps for domains, interfaces, and business lines.
Further analyses include domain feature extraction, IP/UA/URL ranking, and future extensions for finer‑grained statistics and risk output.
0x06 Conclusion
This document covered crawler basics, common crawling techniques, anti‑crawling countermeasures, and an overview of 58’s anti‑crawling capabilities. Continuous innovation and close alignment with business scenarios are essential for staying ahead of evolving crawling threats.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.