Backend Development 13 min read

Scrapy Tutorial: Installation, Project Structure, Basic Usage, and Real‑World Example

This article provides a comprehensive, step‑by‑step guide to the Scrapy web‑crawling framework, covering its core components, installation methods, project layout, spider creation, data extraction techniques, pagination handling, pipeline configuration, and how to run the crawler to collect and store data.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Scrapy Tutorial: Installation, Project Structure, Basic Usage, and Real‑World Example

Scrapy is a fast, high‑level Python framework for web crawling and data extraction, allowing developers to build spiders with minimal code.

The framework consists of several components: Engine, Scheduler, Downloader, Spider, Item, Pipeline, Downloader Middlewares, Spider Middlewares, and Scheduler Middlewares.

Installation can be performed via pip:

$ pip install scrapy

or by downloading the package first:

$ pip download scrapy -d ./
# Using a domestic mirror
$ pip download -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy -d ./

After downloading, install the wheel:

$ pip install Scrapy-1.5.0-py2.py3-none-any.whl

Creating a project is done with:

scrapy startproject mySpider

Generate a spider:

scrapy genspider demo "demo.cn"

Typical workflow includes four steps: creating a project, generating a spider, extracting data (e.g., using XPath or CSS selectors), and saving the data via pipelines.

Running a spider can be done from the command line:

scrapy crawl qb   # qb is the spider name

or programmatically in PyCharm:

from scrapy import cmdline
cmdline.execute("scrapy crawl qb".split())

The project directory contains configuration files such as scrapy.cfg , the Python module folder mySpider/ , items.py , pipelines.py , settings.py , and the spiders/ directory where spider code resides.

Example items.py definition:

import scrapy
class MyspiderItem(scrapy.Item):
    pass  # define fields here

Example pipelines.py for CSV output:

from itemadapter import ItemAdapter
import csv
class MyspiderPipeline:
    def __init__(self):
        self.f = open('Zcool.csv', 'w', encoding='utf-8', newline='')
        self.writer = csv.DictWriter(self.f, fieldnames=['imgLink','title','types','vistor','comment','likes'])
        self.writer.writeheader()
    def process_item(self, item, spider):
        self.writer.writerow(dict(item))
        return item
    def close_spider(self, spider):
        self.f.close()

A sample spider that extracts items from ZCOOL:

import scrapy
class DbSpider(scrapy.Spider):
    name = 'db'
    allowed_domains = ['douban.com']
    start_urls = ['http://douban.com/']
    def parse(self, response):
        pass  # extraction logic here

Data extraction uses selectors, e.g., response.xpath() or response.css() , with methods like extract() , extract_first() , get() , and getall() .

Pagination can be handled by following the "next" link:

next_href = response.xpath("//a[@class='laypage_next']/@href").extract_first()
if next_href:
    next_url = response.urljoin(next_href)
    yield scrapy.Request(next_url)

Alternatively, construct URLs manually using a page counter.

Running the crawler via a helper script ( start.py ) simplifies execution:

from scrapy import cmdline
cmdline.execute('scrapy crawl zc'.split())

After execution, the scraped data is saved to Zcool.csv , confirming successful collection.

PythonData Extractiontutorialweb scrapingScrapycrawler
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.