Backend Development 12 min read

Overview of Web Crawler Types and the Architecture of the Mole Crawler System

This article explains the evolution and classification of web crawlers, describes the design and components of the Mole distributed crawler—including scheduler, fetcher, processor, rate‑limiting, URL deduplication, and Elasticsearch storage optimization—and outlines common anti‑anti‑crawling strategies.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Overview of Web Crawler Types and the Architecture of the Mole Crawler System

Web crawlers are core components of search engines, and as the Internet has grown, the need for efficient data acquisition has driven the development of various crawler types.

The earliest crawler, the World Wide Web Wanderer by Matthew Gray, began as a server‑statistics tool and later evolved to discover domain names. Modern crawlers are now categorized into four main types based on scale, target sites, and objectives:

1. General‑purpose crawlers : Aim to fetch the entire web for commercial search engines such as Google, Bing, Baidu, etc., storing massive amounts of textual content in distributed databases.

2. Focused crawlers : Target specific topics or vertical domains, using relevance‑prediction algorithms (e.g., Best‑first, Fish, Shark) to prioritize URLs.

3. Site‑specific crawlers : Crawl all pages of a particular website, handling rich media with lower data volume and simpler architecture.

4. Incremental (directed) crawlers : Continuously update specific content sources (e.g., RSS/Atom), requiring real‑time processing and format‑specific parsing.

The Mole crawler system, developed by Sogou’s community search team, is a directed crawler that aggregates user‑generated content from various platforms. Its architecture consists of three core components:

Scheduler : Manages URL queues, de‑duplication, and flow‑control.

Fetcher : Provides four downloaders – an asynchronous Tornado‑based HTTP downloader, a PhantomJS simulator for JavaScript, a Puppeteer‑Chromium browser emulator, and an Anyproxy‑Android simulator for app‑like requests.

Processor : Extracts structured data from HTML, XML, and JSON documents.

All components communicate via message queues; the Scheduler runs as a single‑point service due to URL de‑duplication constraints, while Fetcher and Processor can be scaled horizontally.

Rate‑limiting is implemented using a token‑bucket algorithm, which accommodates burst traffic better than the traditional leaky‑bucket approach.

URL de‑duplication in Mole relies on a database‑backed strategy that assigns each URL a task with an expiration (age) and a force flag, allowing fine‑grained scheduling while avoiding repeated fetches.

Collected data are stored in Elasticsearch. Optimizations include disabling the _all field, turning off indexing for unnecessary fields, and disabling doc_values for certain text fields, halving index size without affecting query performance. Write‑performance improvements involve adjusting refresh intervals, separating hot and cold data onto SSDs, and managing bulk queues to prevent timeouts.

The article also outlines anti‑anti‑crawling techniques such as respecting robots.txt, throttling request rates, rotating user‑agents, using IP proxy pools, modifying Referer and Cookie headers, and targeting lightweight WAP pages.

ElasticsearchRate Limitingdistributed systemanti-crawlingURL deduplicationweb crawler
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.