How to Crawl Responsibly: Avoid Legal Risks and Server Overload
This guide outlines responsible web‑crawling practices, covering robots.txt compliance, legal pitfalls such as unauthorized personal data and copyrighted content, recommended request intervals, and relevant Chinese data‑security regulations, helping developers avoid server overloads and potential lawsuits.
General Precautions
When writing and configuring a crawler, pay attention to the type of content you scrape and the interval between requests.
Scrapy’s ROBOTSTXT_OBEY option, when set to true, makes the crawler respect the target site’s robots.txt rules.
What is robots.txt?
Robots.txt is a file stored on a website that tells search‑engine crawlers which directories should not be crawled. Scrapy automatically fetches this file at startup and limits the crawl scope accordingly.
Since we are not building a search engine, we may sometimes need to ignore robots.txt to access data that is otherwise blocked, in which case ROBOTSTXT_OBEY = False should be set.
Content Guidelines
Avoid scraping unauthorized personal information. For example, in the 2016 lawsuit between Weibo and Maimai, Maimai was ordered to stop improper competition and compensate for damages after it harvested non‑Maimai users’ avatars, names, occupations, and education via APIs.
Do not scrape content that the target site explicitly forbids. For instance, many items on Taobao are prohibited from being crawled.
Do not harvest large volumes of copyrighted material for profit. A 2018 case involved prosecution for scraping video data.
Crawl Interval
Reducing the request interval can speed up crawling, but it may resemble a DDoS attack, exhausting bandwidth and crashing servers.
According to Article 16 of the draft “Data Security Management Measures” from the Cyberspace Administration of China, automated data collection must not impede normal website operation; if traffic exceeds one‑third of the site’s average daily traffic, the site may demand that the activity stop.
Conclusion
Personally, I mainly use open datasets that can be downloaded after providing an email address. If you encounter a similar situation in the future, think carefully before proceeding.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.