Backend Development 13 min read

Scrapy Tutorial: Installation, Project Structure, Basic Usage, and Real‑World Example

This article provides a comprehensive, step‑by‑step guide to the Scrapy web‑crawling framework, covering its core components, installation methods, project layout, spider creation, data extraction techniques, pagination handling, pipeline configuration, and how to run the crawler to collect and store data.

Sohu Tech Products

Aug 25, 2021

Scrapy Tutorial: Installation, Project Structure, Basic Usage, and Real‑World Example

Scrapy is a fast, high‑level Python framework for web crawling and data extraction, allowing developers to build spiders with minimal code.

The framework consists of several components: Engine, Scheduler, Downloader, Spider, Item, Pipeline, Downloader Middlewares, Spider Middlewares, and Scheduler Middlewares.

Installation can be performed via pip: $ pip install scrapy or by downloading the package first:

$ pip download scrapy -d ./
# Using a domestic mirror
$ pip download -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy -d ./

After downloading, install the wheel: $ pip install Scrapy-1.5.0-py2.py3-none-any.whl Creating a project is done with: scrapy startproject mySpider Generate a spider: scrapy genspider demo "demo.cn" Typical workflow includes four steps: creating a project, generating a spider, extracting data (e.g., using XPath or CSS selectors), and saving the data via pipelines.

Running a spider can be done from the command line: scrapy crawl qb # qb is the spider name or programmatically in PyCharm:

from scrapy import cmdline
cmdline.execute("scrapy crawl qb".split())

The project directory contains configuration files such as scrapy.cfg, the Python module folder mySpider/, items.py, pipelines.py, settings.py, and the spiders/ directory where spider code resides.

Example items.py definition:

import scrapy
class MyspiderItem(scrapy.Item):
    pass  # define fields here

Example pipelines.py for CSV output:

from itemadapter import ItemAdapter
import csv
class MyspiderPipeline:
    def __init__(self):
        self.f = open('Zcool.csv', 'w', encoding='utf-8', newline='')
        self.writer = csv.DictWriter(self.f, fieldnames=['imgLink','title','types','vistor','comment','likes'])
        self.writer.writeheader()
    def process_item(self, item, spider):
        self.writer.writerow(dict(item))
        return item
    def close_spider(self, spider):
        self.f.close()

A sample spider that extracts items from ZCOOL:

import scrapy
class DbSpider(scrapy.Spider):
    name = 'db'
    allowed_domains = ['douban.com']
    start_urls = ['http://douban.com/']
    def parse(self, response):
        pass  # extraction logic here

Data extraction uses selectors, e.g., response.xpath() or response.css(), with methods like extract(), extract_first(), get(), and getall().

Pagination can be handled by following the "next" link:

next_href = response.xpath("//a[@class='laypage_next']/@href").extract_first()
if next_href:
    next_url = response.urljoin(next_href)
    yield scrapy.Request(next_url)

Alternatively, construct URLs manually using a page counter.

Running the crawler via a helper script ( start.py) simplifies execution:

from scrapy import cmdline
cmdline.execute('scrapy crawl zc'.split())

After execution, the scraped data is saved to Zcool.csv, confirming successful collection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Data Extraction Web Scraping Scrapy Crawler

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.