Tag

URL deduplication

0 views collected around this technical thread.

Full-Stack Internet Architecture
Full-Stack Internet Architecture
Sep 13, 2020 · Backend Development

URL Deduplication Techniques in Java, Redis, and Databases

This article reviews six practical URL deduplication methods—including Java Set, Redis Set, database queries, unique indexes, Guava Bloom filter, and Redis Bloom filter—explaining their principles, providing complete implementation code, and recommending the most suitable approach for different system scales.

Bloom FilterDatabaseJava
0 likes · 13 min read
URL Deduplication Techniques in Java, Redis, and Databases
Sohu Tech Products
Sohu Tech Products
Dec 5, 2018 · Backend Development

Overview of Web Crawler Types and the Architecture of the Mole Crawler System

This article explains the evolution and classification of web crawlers, describes the design and components of the Mole distributed crawler—including scheduler, fetcher, processor, rate‑limiting, URL deduplication, and Elasticsearch storage optimization—and outlines common anti‑anti‑crawling strategies.

ElasticsearchRate LimitingURL deduplication
0 likes · 12 min read
Overview of Web Crawler Types and the Architecture of the Mole Crawler System