Big Data 24 min read

Understanding Simhash: From Traditional Hash to Random Projection LSH

This article explains the principles and implementation of Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, and practical optimizations for large‑scale duplicate detection.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Understanding Simhash: From Traditional Hash to Random Projection LSH

Preface

It is well known that the WeChat public account platform has high commercial value because of its strong original‑content protection; when you try to claim another article as original, WeChat returns an error indicating the originality check failed.

If you sniff the traffic you will see an error message similar to the one shown below.

Changing a few characters to evade detection does not work; the failure is due to WeChat’s originality detection which uses the Simhash algorithm, originally invented by Google for large‑scale web deduplication. Simhash generates a fingerprint for each article and compares fingerprints to assess similarity. The article will reveal how Simhash works.

The article’s structure is as follows:

Traditional Hash and Its Limitations

Cosine Theorem Implementation and Its Limitations

Dimensionality Reduction via Random Projection

Simhash Principle and Implementation

Traditional Hash and Its Limitations

To compare two articles for equality, a common approach is:

Hash the article with a function such as MD5 to obtain a fixed‑length string (e.g., 32 characters).

Compare the two fixed‑length strings for equality.

The first step maps a large space to a small one, producing a “fingerprint”. However, traditional hash functions are designed for high randomness: even a single character change yields a completely different hash, making them unsuitable for similarity measurement.

For example, SHA1 of the Chinese strings “我是中国人” and “我是中国人啊” differs entirely despite only one extra character, so direct comparison is impossible and would require character‑by‑character comparison, which is inefficient.

We need a hash function that satisfies two conditions:

Supports local similarity.

Produces a result that is easy to compare.

To make comparison easy, the hash result can be converted to a fixed‑length binary string (0/1). By XOR‑ing two such strings, the number of 1s indicates the number of differing bits—this is the core idea of Simhash.

As shown, after XOR only two bits are 1, meaning only two positions differ.

Next, we examine how Simhash yields locally similar results. Its computation resembles cosine similarity, so we first review cosine similarity.

Cosine Theorem

The concept first appeared in Wu Jun’s "The Beauty of Mathematics". By representing a document as an n‑dimensional vector, the cosine of the angle between two vectors indicates similarity.

句子A:我喜欢看电视,不喜欢看电影。
句子B:我不喜欢看电视,也不喜欢看电影。

Step 1: Tokenization

句子A:我/喜欢/看/电视,不/喜欢/看/电影。
句子B:我/不/喜欢/看/电视,也/不/喜欢/看/电影。

Step 2: List all keywords.

我,喜欢,看,电视,电影,不,也。
Side note: Use TF‑IDF to filter out stop words such as “的”, “地”, “得”.

Step 3: Compute term frequencies.

句子A:我 1,喜欢 2,看 2,电视 1,电影 1,不 1,也 0。
句子B:我 1,喜欢 2,看 2,电视 1,电影 1,不 2,也 1。

Step 4: Form frequency vectors.

句子A:[1, 2, 2, 1, 1, 1, 0]
句子B:[1, 2, 2, 1, 1, 2, 1]

Note: In practice TF‑IDF weights are used instead of raw counts.

The similarity problem becomes computing the cosine of the angle between the two vectors; a cosine close to 1 means a small angle and high similarity.

The cosine formula is:

Applying the formula yields a cosine of 0.938, i.e., 93.8% similarity, which matches intuition. However, cosine similarity requires many multiplications and square‑roots on high‑dimensional vectors, making it unsuitable for massive web‑scale deduplication.

Dimensionality Reduction via Random Projection

Meaning of Vector Dot Product

Random projection relies on vector dot products. Understanding dot product is essential.

In a 2‑D space, the dot product of two vectors is defined as:

It represents the product of the lengths of the projections of one vector onto the other.

Assume two vectors as shown:

Projecting vector A onto a new axis reduces its dimensionality from 2‑D to 1‑D. In general, a random Gaussian vector can project an M‑dimensional vector down to K dimensions while approximately preserving angles (Johnson–Lindenstrauss lemma).

Random Projection Dimensionality Reduction & LSH

After random projection, the resulting coordinates are often floating‑point numbers, which are costly to store and compute. By discretizing each coordinate to 0 or 1 (negative values become 0, non‑negative become 1), we obtain a binary representation suitable for fast similarity checks.

This discretization is known as Locality‑Sensitive Hashing (LSH) based on random projection. The binary vectors retain proximity relationships while enabling rapid Hamming‑distance calculations.

Random Hyperplane Hash

Random hyperplane hashing is a variant of the above LSH. For an n‑dimensional vector v, to obtain an f‑bit signature (f ≪ n):

Randomly generate f n‑dimensional vectors r₁…r_f.

For each r_i, if the dot product v·r_i > 0, set the i‑th bit to 1; otherwise set it to 0.

This effectively places f random hyperplanes in space; each hyperplane splits the space into two halves, assigning a 1 or 0 depending on which side v lies.

Illustration: points falling into different regions receive different binary signatures such as 110.

The final f‑bit signature is the article’s fingerprint; comparing fingerprints via XOR reveals the number of differing bits, i.e., the Hamming distance.

Simhash Principle and Implementation

Simhash builds on random hyperplane hashing. Its generation process consists of five steps: tokenization → hashing → weighting → aggregation → dimensionality reduction.

Tokenization : Split the text into meaningful words and assign a weight (e.g., 1‑5) to each term.

Hash : Hash each term to a binary code (e.g., "美国" → 100101, "51区" → 101011). Convert 0 to –1 to spread vectors across quadrants.

Weighting : Multiply each bit of the term’s hash by its weight, producing weighted vectors.

Aggregation : Sum the weighted vectors element‑wise, yielding a single vector.

Dimensionality Reduction : Convert each element to 1 if positive, otherwise 0, forming the final Simhash signature.

Example (simplified to two terms):

100101 → 1‑1‑1‑0‑1‑1
101011 → 1‑0‑1‑0‑1‑1

Weighted vectors (weights 4 and 5) become:

美国: 4 -4 -4 4 -4 4
51区: 5 -5 5 -5 5 5

Aggregating yields (9 -9 1 -1 1 9); after reduction the final 6‑bit Simhash is 101011.

In practice Simhash signatures are 64 bits; two documents are considered similar if their Hamming distance ≤ 3.

Simhash Query Optimization

Directly XOR‑comparing a 64‑bit signature against billions of entries is infeasible. Using the pigeonhole principle, the signature can be split into four 16‑bit blocks. If two signatures differ in ≤ 3 bits, at least one block must match exactly.

Side note: The pigeonhole principle states that placing three apples into four drawers guarantees at least one empty drawer.

We store each article as four key‑value pairs: each 16‑bit block serves as a key (K) and the remaining three blocks as the value (V). During a query we first look up exact matches on K (O(1) time). For each matching K we then compare the corresponding V’s, reducing the number of full‑signature comparisons dramatically.

Assuming a database of 2³⁰ (~1 billion) entries:

Without the pigeonhole trick: ~1 billion comparisons.

With the trick: up to 4 × 2¹⁴ ≈ 65 000 comparisons.

This space‑for‑time trade‑off reduces query time by several orders of magnitude at the cost of roughly four‑fold increased storage.

Simhash Drawbacks

Simhash works best for massive, long texts. For short texts the Hamming‑distance threshold of 3 often yields false negatives, which is why WeChat requires articles to exceed 300 characters for originality claims.

Conclusion

The key to mastering Simhash is understanding random hyperplane hashing, which enables high‑dimensional vectors to be projected into low‑dimensional binary fingerprints. Many existing tutorials skip the dimensionality‑reduction step, leaving readers confused. This article provides a step‑by‑step walkthrough and references for deeper study.

https://www.cnblogs.com/shaosks/p/9121774.html

局部敏感哈希算法及其思想的讨论:https://my.oschina.net/u/4367429/blog/3261406

http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

https://zhuanlan.zhihu.com/p/81026564

http://www.hanting.tech/2017/05/23/simhash.html

https://zhuanlan.zhihu.com/p/92155250

https://blog.csdn.net/sunny_ss12/article/details/46949315

https://cloud.tencent.com/developer/article/1189493

https://www.cnblogs.com/sddai/p/10088007.html

https://www.cnblogs.com/hxsyl/p/4518506.html

https://www.iyunying.org/seo/dataanalysis/152232.html

https://cloud.tencent.com/developer/article/1390215

algorithmbig datacosine similaritySimhashdimensionality reductionLocality Sensitive Hashing
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.