Tag

information fingerprint

1 views collected around this technical thread.

360 Quality & Efficiency
360 Quality & Efficiency
Oct 19, 2018 · Big Data

Information Fingerprint and Simhash Algorithm for Large-Scale Duplicate Detection

This article explains the concept of information fingerprints, compares traditional set‑equality methods, introduces the Simhash algorithm for high‑dimensional text similarity reduction, and demonstrates how partitioned 64‑bit fingerprints enable efficient duplicate detection on massive web data.

Duplicate DetectionSimhashbig data
0 likes · 6 min read
Information Fingerprint and Simhash Algorithm for Large-Scale Duplicate Detection