Backend Development 14 min read

Scrapy Tutorial: Crawling Comic Images with BeautifulSoup and Saving Locally

This article provides a step‑by‑step guide on configuring Scrapy, creating a spider project, extracting comic page URLs and images using BeautifulSoup, handling pagination, and saving the downloaded images locally with Python code.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Scrapy Tutorial: Crawling Comic Images with BeautifulSoup and Saving Locally

Scrapy is a Python framework for extracting structured data from websites; this tutorial demonstrates how to use it to crawl comic images.

Scrapy environment configuration

Install the required packages on macOS:

pip install Scrapy
pip install beautifulsoup4
pip install html5lib

Verify installation by running scrapy and checking the output.

Project creation

Create a new Scrapy project named Comics :

scrapy startproject Comics

The generated directory structure looks like:

|____Comics
| |______init__.py
| |______pycache__
| |______items.py
| |______pipelines.py
| |______settings.py
| |______spiders
| | |______init__.py
| | |______pycache__
|____scrapy.cfg

Print the structure with:

find . -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'

Creating the Spider class

In Comics/spiders/comics.py define a spider that inherits from scrapy.Spider :

#coding:utf-8
import scrapy
class Comics(scrapy.Spider):
name = "comics"
def start_requests(self):
urls = ['http://www.xeall.com/shenshi']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
self.log(response.body);

Run the spider with:

scrapy crawl comics

The spider prints log information and the HTML source of the target page.

Crawling comic URLs

In parse , use BeautifulSoup to extract the list of comics:

from bs4 import BeautifulSoup
content = response.body;
soup = BeautifulSoup(content, "html5lib")
listcon_tag = soup.find('ul', class_='listcon')
com_a_list = listcon_tag.find_all('a', attrs={'href': True})
comics_url_list = []
base = 'http://www.xeall.com'
for tag_a in com_a_list:
url = base + tag_a['href']
comics_url_list.append(url)

Handle pagination by locating the ul.pagelist element, extracting the next‑page URL, and yielding a new request unless the current page is the last one:

page_tag = soup.find('ul', class_='pagelist')
page_a_list = page_tag.find_all('a', attrs={'href': True})
select_tag = soup.find('select', attrs={'name': 'sldd'})
option_list = select_tag.find_all('option')
last_option = option_list[-1]
current_option = select_tag.find('option', attrs={'selected': True})
is_last = (last_option.string == current_option.string)
if not is_last:
next_page = 'http://www.xeall.com/shenshi/' + page_a_list[-2]['href']
yield scrapy.Request(next_page, callback=self.parse)

For each comic URL, schedule a request to self.comics_parse :

for url in comics_url_list:
yield scrapy.Request(url=url, callback=self.comics_parse)

Extracting comic images

In comics_parse , parse the page and locate the image tag:

def comics_parse(self, response):
content = response.body;
soup = BeautifulSoup(content, "html5lib")
li_tag = soup.find('li', id='imgshow')
img_tag = li_tag.find('img')
img_url = img_tag['src']
title = img_tag['alt']

Save the image locally using a helper method:

self.save_img(page_num, title, img_url)

Saving images to disk

# 先导入库
import os
import urllib
import zlib
def save_img(self, img_mun, title, img_url):
self.log('saving pic: ' + img_url)
document = '/Users/moshuqi/Desktop/cartoon'
comics_path = document + '/' + title
if not os.path.exists(comics_path):
os.makedirs(comics_path)
pic_name = comics_path + '/' + img_mun + '.jpg'
if os.path.exists(pic_name):
self.log('pic exists: ' + pic_name)
return
try:
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib.request.Request(img_url, headers=headers)
response = urllib.request.urlopen(req, timeout=30)
data = response.read()
if response.info().get('Content-Encoding') == 'gzip':
data = zlib.decompress(data, 16 + zlib.MAX_WBITS)
fp = open(pic_name, "wb")
fp.write(data)
fp.close
self.log('save image finished:' + pic_name)
except Exception as e:
self.log('save image error.')
self.log(e)

The spider continues to request the next image page until the # link indicates the last page.

Running results

When executed, the console shows log messages for each request, and the downloaded images are stored in folders named after each comic title, with filenames corresponding to page numbers. Scrapy runs multiple requests concurrently, so several comics are crawled in parallel.

Note that the target site may be slow, causing occasional timeouts; patience is required.

The article mentions that more advanced features such as FilesPipeline , ImagesPipeline , or using Scrapy’s built‑in XPath selectors can further improve efficiency.

PythonautomationScrapyweb crawlingBeautifulSoupImage Download
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.