Information Security 15 min read

Anti‑Crawling Strategies and System Design: Insights from Ctrip Hotel R&D

This article shares practical anti‑crawling concepts, classifications of crawlers, design principles, traditional and JavaScript‑based countermeasures, and operational trade‑offs, illustrating how Ctrip's hotel R&D team balances commercial protection with technical feasibility.

Ctrip Technology

Jun 30, 2016

Anti‑Crawling Strategies and System Design: Insights from Ctrip Hotel R&D

Editor: The content originates from a Ctrip Hotel R&D manager’s talk in the third "Ctrip Tech Micro‑Share" session, summarizing anti‑crawling experiences and recommendations.

Why anti‑crawling matters

1. Crawlers consume a large share of page views, especially during the March peak, leading to wasted resources and revenue loss.

2. Free public resources (e.g., OTA price data) can be harvested at scale, reducing competitive advantage.

3. Legal uncertainty exists; while lawsuits may be possible, technical safeguards remain essential.

Which crawlers to target

1. Low‑skill crawlers written by fresh graduates – simple, aggressive, and likely to overload servers.

2. Small start‑up crawlers – built out of data hunger, often uncontrolled.

3. Unmaintained rogue crawlers – continue to run despite errors, consuming bandwidth.

4. Established commercial competitors – well‑funded and technically capable.

5. Misbehaving search engines – can cause performance degradation similar to attacks.

Definitions

Crawler: Any technique that batch‑retrieves website information.

Anti‑crawler: Any technique that blocks others from batch‑retrieving your site’s data.

False positive (mis‑hit): Legitimate users mistakenly identified as crawlers; high mis‑hit rates render a strategy unusable.

Interception: Successful blocking of a crawler, usually measured by interception rate.

Resources: Combined machine and human costs; human cost increasingly dominates due to rising developer salaries.

How to write a simple crawler

Typical steps: analyze request format, craft appropriate HTTP requests, and batch‑fetch data. Example: inspecting Ctrip’s price‑loading request and sorting by data volume to identify the target URL.

Advanced crawler techniques

• Distributed crawling – mainly to evade IP bans, not a true performance boost.

• JavaScript simulation – often unnecessary if the target does not employ anti‑crawling measures.

• PhantomJS – powerful but inefficient and detectable.

Pros and cons of crawler levels

Low‑level crawlers are cheap and fast but easy to block; high‑level crawlers are harder to block but costly, with diminishing returns beyond a certain investment.

Typical anti‑crawling architecture

1. Pre‑process incoming requests for identification.

2. Detect whether the request originates from a crawler.

3. Apply appropriate mitigation (e.g., serve fake data, block IP, throttle).

However, without reliable detection, mitigation cannot be targeted effectively.

Traditional anti‑crawling methods

• IP‑based rate limiting – easy to evade by purchasing IPs.

• Session‑based limiting – ineffective as sessions are cheap to recreate.

• User‑Agent limiting – powerful but risks high false‑positive rates.

Combining these methods can improve effectiveness while reducing mis‑hits.

JavaScript‑based countermeasures

Simple demos modify request URLs or keys to return incorrect prices, making detection harder when combined with browser‑specific checks (IE bugs, Firefox strictness, Chrome features).

Psychological and operational tactics

1. Technical suppression – avoid overly aggressive actions early on.

2. Psychological warfare – provocation, mockery, or empathy to influence opponent behavior.

3. “Watering down” – deliberately allow limited crawling to avoid escalation.

Conclusion

Anti‑crawling is a cost‑benefit game; beyond a certain investment, additional effort yields little return. Effective strategies often rely on lightweight JavaScript checks and balanced mitigation to protect commercial interests without harming legitimate users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend System Design traffic management Web Security anti‑crawling

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.