Backend Development 8 min read

Python Web Scraper for Downloading Online Comics

This article explains how to build a Python script that searches a comic website, extracts chapter links and image URLs using requests and BeautifulSoup, and downloads the images into organized folders with multithreaded support, while outlining required modules and potential improvements.

Python Programming Learning Circle

Aug 28, 2024

Python Web Scraper for Downloading Online Comics

The article demonstrates a complete Python solution for downloading online comics by scraping the website https://www.mkzhan.com/ . It starts by listing the required modules (requests, urllib, threading, os, sys) and notes that urllib alone can replace requests.

1. How to implement

First, the script builds a search URL using urllib.parse to encode the comic name entered by the user, then sends a GET request to retrieve the search results page.

from urllib import parse
_name = input('请输入你想看的漫画:')
name_ = parse.urlencode({'keyword': _name})
url = 'https://www.mkzhan.com/search/?{}'.format(name_)

After obtaining the search results, the script parses the HTML with BeautifulSoup to extract comic titles, links, and keywords, then prompts the user to select a comic.

1.1 Extracting required data (comic URL, name, chapter list)

The selected comic’s detail page is fetched, and the chapter list is located inside the ul.chapter__list-box.clearfix.hide element. The script collects each chapter’s link ( data-hreflink) and title.

html1 = requests.get(url=url1)
content1 = html1.text
soup1 = BeautifulSoup(content1, 'lxml')
str2 = soup1.select('ul.chapter__list-box.clearfix.hide')[0]
list2 = str2.select('li>a')
name1 = []
href1 = []
for str3 in list2:
    href1.append(str3['data-hreflink'])   # chapter link
    name1.append(str3.get_text().strip()) # chapter title

Each chapter page contains only images. The script extracts all image URLs from the div.rd-article__pic.hide > img.lazy-read selector and downloads them.

def Downlad(href1, path):
    headers = {'User-Agent': 'Mozilla/5.0 ...'}
    url2 = 'https://www.mkzhan.com' + href1
    html2 = requests.get(url=url2, headers=headers)
    soup2 = BeautifulSoup(html2.text, 'lxml')
    list_1 = soup2.select('div.rd-article__pic.hide>img.lazy-read')
    urls = [img['data-src'] for img in list_1]
    for i, url in enumerate(urls):
        content3 = requests.get(url=url, headers=headers)
        with open(file=path + f'/{i+1}.jpg', mode='wb') as f:
            f.write(content3.content)
    return True

To speed up downloading, a multithreaded driver creates a folder for each chapter and launches up to 30 threads that call Main_Downlad repeatedly until all chapters are processed.

def Main_Downlad(href1:list, name1:list):
    while True:
        if len(href1) == 0:
            break
        href = href1.pop()
        name = name1.pop()
        try:
            path = f'./{_name}/{name}'
            os.mkdir(path)
            if Downlad(href, path):
                print('线程{}正在下载章节{}'.format(threading.current_thread().getName(), name))
        except:
            pass

threading_1 = []
for i in range(30):
    t = threading.Thread(target=Main_Downlad, args=(href1, name1))
    t.start()
    threading_1.append(t)
for t in threading_1:
    t.join()
print('当前线程为{}'.format(threading.current_thread().getName()))

After execution, the script creates a directory named after the comic, containing sub‑folders for each chapter with the downloaded JPEG images.

2. Full code

import requests
from urllib import parse
from bs4 import BeautifulSoup
import threading
import os
import sys

_name = input('请输入你想看的漫画:')
try:
    os.mkdir(f'./{_name}')
except:
    print('已经存在相同的文件夹了,程序无法在继续进行！')
    sys.exit()

name_ = parse.urlencode({'keyword': _name})
url = f'https://www.mkzhan.com/search/?{name_}'
html = requests.get(url=url)
content = html.text
soup = BeautifulSoup(content, 'lxml')
list1 = soup.select('div.common-comic-item')
names, hrefs, keywords = [], [], []
for item in list1:
    names.append(item.select('p.comic__title>a')[0].get_text())
    hrefs.append(item.select('p.comic__title>a')[0]['href'])
    keywords.append(item.select('p.comic-feature')[0].get_text())
print('匹配到的结果如下：')
for i, (n, k) in enumerate(zip(names, keywords), 1):
    print(f'【{i}】-{n}     {k}')

i = int(input('请输入你想看的漫画序号:'))
url1 = 'https://www.mkzhan.com' + hrefs[i-1]
# ... (the rest follows the functions shown above)

The author notes that the script is for learning and entertainment only, and suggests future enhancements such as adding an IP proxy pool and an automatic image viewer.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Web Scraping requests beautifulsoup Manga Downloader

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.