Operations 10 min read

How to Use Redis for Efficient Deduplication in Operations Data Analysis

This article explains practical methods for deduplicating and counting data in operational analytics using Redis, covering SET, ZSET, BITSET, HyperLogLog, and Bloom filter structures, their advantages, limitations, and suitable scenarios for real‑time and large‑scale metric calculations.

Efficient Ops

Mar 15, 2016

How to Use Redis for Efficient Deduplication in Operations Data Analysis

Overview

Today, a data platform engineer shares practical ways to perform deduplication and counting in operational data analysis using Redis. The article defines the following Redis data structures for the discussion: SET:

sadd key member

ZSET

zadd key score member

HYPERLOGLOG

pfadd key element

STRING

: setbit key offset value Key terms are also defined: Dimension (e.g., version, OS type, device model) and Composite Dimension (combination of two or more dimensions).

1. Set‑based Deduplication

This straightforward method inserts each unique element as a member of a Redis SET and uses SCARD to get the cardinality. Dimensions can be encoded in the key name.

Advantages: Simple to use, precise counting.

Disadvantages: Not suitable for real‑time statistics; high memory consumption grows with each additional dimension, making it impractical for three or more composite dimensions.

Applicable Scenarios: When the unique count is very small, such as counting a limited number of active devices in a mobile app.

2. ZSET‑based Deduplication

Using a sorted set, the key stores dimension information, the score stores a timestamp, and the member stores the device ID. ZCOUNT retrieves the number of members within a time range.

Advantages: Insertion and counting are O(log N); can precisely count users from now to any past point while preserving raw data.

Disadvantages: Only left‑bounded time intervals are supported; memory usage grows with each dimension, making it unsuitable for three or more composite dimensions.

Applicable Scenarios: Real‑time active‑device counting over short windows (e.g., 1 min, 5 min, 10 min) where only a few dimensions are needed.

3. Bitset‑based Deduplication

Map each user ID to a single bit in a bitmap; modern CPUs can compute bit operations very quickly.

Advantages: Precise results; supports AND/OR aggregation across dimensions; low‑cost operations for cross‑dimension calculations.

Disadvantages: Not suitable for minute‑level granularity; mapping user IDs to bits can cause hash collisions or require a large in‑memory mapping table.

Applicable Scenarios: Storing atomic data for long‑term or flexible time‑span aggregations.

Note: In Java, BitSet stores bits from low to high within a byte, while Redis stores them from high to low.

4. HyperLogLog‑based Deduplication

HyperLogLog provides probabilistic cardinality estimation using only about 12 KB per instance with an error rate of 0.81 %.

http://blog.codinglabs.org/tag.html#基数估计

Advantages: Extremely low memory usage; can merge multiple HyperLogLog structures to obtain deduplicated counts for arbitrary time ranges; O(1) update time.

Disadvantages: Not suitable when exact counts are required.

Applicable Scenarios: Situations where a small error margin is acceptable, similar to bitset use cases.

5. Bloom Filter‑based Deduplication

Bloom filters use multiple hash functions to achieve low false‑positive rates with high space efficiency. Redis does not have a native Bloom filter type, but it can be implemented via Lua scripts.

https://github.com/erikdubbelboer/redis-lua-scaling-bloom-filter1

Advantages: For 2 million users, an error rate below 0.01 % can be achieved with roughly 8 MB of Redis memory.

Disadvantages: Bloom filter results cannot be merged.

Applicable Scenarios: Real‑time metrics with few dimensions, such as displaying today’s active device count on a dashboard.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

HyperLogLog Redis Deduplication

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.