Backend Development 14 min read

Scrapy Tutorial: Crawling Comic Images with BeautifulSoup and Saving Locally

This article provides a step‑by‑step guide on configuring Scrapy, creating a spider project, extracting comic page URLs and images using BeautifulSoup, handling pagination, and saving the downloaded images locally with Python code.

Python Programming Learning Circle

May 21, 2020

Scrapy Tutorial: Crawling Comic Images with BeautifulSoup and Saving Locally

Scrapy is a Python framework for extracting structured data from websites; this tutorial demonstrates how to use it to crawl comic images.

Scrapy environment configuration

Install the required packages on macOS:

pip install Scrapy

pip install beautifulsoup4

pip install html5lib

Verify installation by running scrapy and checking the output.

Project creation

Create a new Scrapy project named Comics: scrapy startproject Comics The generated directory structure looks like:

|____Comics

| |______init__.py

| |______pycache__

| |______items.py

| |______pipelines.py

| |______settings.py

| |______spiders

| | |______init__.py

| | |______pycache__

|____scrapy.cfg

Print the structure with: find . -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g' Creating the Spider class

In Comics/spiders/comics.py define a spider that inherits from scrapy.Spider:

#coding:utf-8

import scrapy

class Comics(scrapy.Spider):

name = "comics"

def start_requests(self):

urls = ['http://www.xeall.com/shenshi']

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

self.log(response.body);

Run the spider with: scrapy crawl comics The spider prints log information and the HTML source of the target page.

Crawling comic URLs

In parse, use BeautifulSoup to extract the list of comics:

from bs4 import BeautifulSoup

content = response.body;

soup = BeautifulSoup(content, "html5lib")

listcon_tag = soup.find('ul', class_='listcon')

com_a_list = listcon_tag.find_all('a', attrs={'href': True})

comics_url_list = []

base = 'http://www.xeall.com'

for tag_a in com_a_list:

url = base + tag_a['href']

comics_url_list.append(url)

Handle pagination by locating the ul.pagelist element, extracting the next‑page URL, and yielding a new request unless the current page is the last one:

page_tag = soup.find('ul', class_='pagelist')

page_a_list = page_tag.find_all('a', attrs={'href': True})

select_tag = soup.find('select', attrs={'name': 'sldd'})

option_list = select_tag.find_all('option')

last_option = option_list[-1]

current_option = select_tag.find('option', attrs={'selected': True})

is_last = (last_option.string == current_option.string)

if not is_last:

next_page = 'http://www.xeall.com/shenshi/' + page_a_list[-2]['href']

yield scrapy.Request(next_page, callback=self.parse)

For each comic URL, schedule a request to self.comics_parse:

for url in comics_url_list:

yield scrapy.Request(url=url, callback=self.comics_parse)

Extracting comic images

In comics_parse, parse the page and locate the image tag:

def comics_parse(self, response):

content = response.body;

soup = BeautifulSoup(content, "html5lib")

li_tag = soup.find('li', id='imgshow')

img_tag = li_tag.find('img')

img_url = img_tag['src']

title = img_tag['alt']

Save the image locally using a helper method: self.save_img(page_num, title, img_url) Saving images to disk

# 先导入库

import os

import urllib

import zlib

def save_img(self, img_mun, title, img_url):

self.log('saving pic: ' + img_url)

document = '/Users/moshuqi/Desktop/cartoon'

comics_path = document + '/' + title

if not os.path.exists(comics_path):

os.makedirs(comics_path)

pic_name = comics_path + '/' + img_mun + '.jpg'

if os.path.exists(pic_name):

self.log('pic exists: ' + pic_name)

return

try:

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

headers = { 'User-Agent' : user_agent }

req = urllib.request.Request(img_url, headers=headers)

response = urllib.request.urlopen(req, timeout=30)

data = response.read()

if response.info().get('Content-Encoding') == 'gzip':

data = zlib.decompress(data, 16 + zlib.MAX_WBITS)

fp = open(pic_name, "wb")

fp.write(data)

fp.close

self.log('save image finished:' + pic_name)

except Exception as e:

self.log('save image error.')

self.log(e)

The spider continues to request the next image page until the # link indicates the last page.

Running results

When executed, the console shows log messages for each request, and the downloaded images are stored in folders named after each comic title, with filenames corresponding to page numbers. Scrapy runs multiple requests concurrently, so several comics are crawled in parallel.

Note that the target site may be slow, causing occasional timeouts; patience is required.

The article mentions that more advanced features such as FilesPipeline, ImagesPipeline, or using Scrapy’s built‑in XPath selectors can further improve efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scrapy beautifulsoup Image Download

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.