Backend Development 7 min read

Python Web Crawler for Downloading Drama Links from cn163.net

This article describes how to build a Python web crawler that automatically generates numeric URLs, checks their validity, extracts download links for TV dramas from cn163.net, saves them to text files, and discusses practical challenges such as regex parsing, filename restrictions, and multithreading performance.

Python Programming Learning Circle

Aug 31, 2021

Python Web Crawler for Downloading Drama Links from cn163.net

The author, who enjoys watching foreign TV series, wanted a more convenient way to download episodes from the "天天美剧" website. By leveraging Python web scraping skills, they created a crawler that automatically generates article URLs based on numeric IDs, filters out non‑existent pages using HTTP status codes, extracts download links, and writes them to plain‑text files for easy use with download managers.

The crawling logic relies on the observation that each drama page URL follows the pattern http://cn163.net/archives/<id>/. A range of IDs (e.g., 2015 to 25000) is iterated, and for each URL the response status is checked; 404 responses are skipped while successful pages are processed to retrieve links via regular expressions.

The full implementation is provided below. It includes functions for generating URLs, saving links, and a main routine that runs the crawler in a separate thread. The code uses the requests library for HTTP requests, re for pattern matching, and threading for basic concurrency.

<ol style="max-width: 100%; color: rgb(150, 152, 150)"><li style="max-width: 100%"><p style="max-width: 100%"><code>def get_urls(self):</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>try:</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>for i in range(2015,25000):</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>    base_url = 'http://cn163.net/archives/'</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>    url = base_url + str(i) + '/'</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>    if requests.get(url).status_code == 404:</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>        continue</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>    else:</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>        self.save_links(url)</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>except Exception, e:</code></p></li><li style="max-width: 100%"><p style="max-width: 100%"><code>    pass</code></p></li></ol>

During development the author found that regular expressions performed better than BeautifulSoup for extracting the required ed2k links, but noted that about half of the links were still missed, indicating room for improvement. File‑name handling proved tricky because Windows file systems disallow characters such as slashes and parentheses, requiring additional sanitization logic.

Although multithreading was added, the Global Interpreter Lock (GIL) limited its effectiveness; the entire crawl of roughly twenty‑thousand pages completed in under twenty minutes. The author considered using Redis for distributed crawling across multiple Linux machines but decided it was unnecessary for the current dataset.

Overall, the article provides a practical example of building a Python‑based web scraper for media content, discusses common pitfalls, and offers a complete codebase that can be adapted for similar crawling tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python multithreading regex requests file-handling crawling web-scraping

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.