How to Use Redis for Efficient Deduplication in Operations Data Analysis
This article explains practical methods for deduplicating and counting data in operational analytics using Redis, covering SET, ZSET, BITSET, HyperLogLog, and Bloom filter structures, their advantages, limitations, and suitable scenarios for real‑time and large‑scale metric calculations.
Overview
Today, a data platform engineer shares practical ways to perform deduplication and counting in operational data analysis using Redis. The article defines the following Redis data structures for the discussion:
SET:
sadd key member ZSET:
zadd key score member HYPERLOGLOG:
pfadd key element STRING:
setbit key offset valueKey terms are also defined:
Dimension(e.g., version, OS type, device model) and
Composite Dimension(combination of two or more dimensions).
1. Set‑based Deduplication
This straightforward method inserts each unique element as a member of a Redis SET and uses
SCARDto get the cardinality. Dimensions can be encoded in the key name.
Advantages: Simple to use, precise counting.
Disadvantages: Not suitable for real‑time statistics; high memory consumption grows with each additional dimension, making it impractical for three or more composite dimensions.
Applicable Scenarios: When the unique count is very small, such as counting a limited number of active devices in a mobile app.
2. ZSET‑based Deduplication
Using a sorted set, the key stores dimension information, the score stores a timestamp, and the member stores the device ID.
ZCOUNTretrieves the number of members within a time range.
Advantages: Insertion and counting are O(log N); can precisely count users from now to any past point while preserving raw data.
Disadvantages: Only left‑bounded time intervals are supported; memory usage grows with each dimension, making it unsuitable for three or more composite dimensions.
Applicable Scenarios: Real‑time active‑device counting over short windows (e.g., 1 min, 5 min, 10 min) where only a few dimensions are needed.
3. Bitset‑based Deduplication
Map each user ID to a single bit in a bitmap; modern CPUs can compute bit operations very quickly.
Advantages: Precise results; supports AND/OR aggregation across dimensions; low‑cost operations for cross‑dimension calculations.
Disadvantages: Not suitable for minute‑level granularity; mapping user IDs to bits can cause hash collisions or require a large in‑memory mapping table.
Applicable Scenarios: Storing atomic data for long‑term or flexible time‑span aggregations.
Note: In Java, BitSet stores bits from low to high within a byte, while Redis stores them from high to low.
4. HyperLogLog‑based Deduplication
HyperLogLog provides probabilistic cardinality estimation using only about 12 KB per instance with an error rate of 0.81 %.
<code>http://blog.codinglabs.org/tag.html#基数估计</code>Advantages: Extremely low memory usage; can merge multiple HyperLogLog structures to obtain deduplicated counts for arbitrary time ranges; O(1) update time.
Disadvantages: Not suitable when exact counts are required.
Applicable Scenarios: Situations where a small error margin is acceptable, similar to bitset use cases.
5. Bloom Filter‑based Deduplication
Bloom filters use multiple hash functions to achieve low false‑positive rates with high space efficiency. Redis does not have a native Bloom filter type, but it can be implemented via Lua scripts.
<code>https://github.com/erikdubbelboer/redis-lua-scaling-bloom-filter1</code>Advantages: For 2 million users, an error rate below 0.01 % can be achieved with roughly 8 MB of Redis memory.
Disadvantages: Bloom filter results cannot be merged.
Applicable Scenarios: Real‑time metrics with few dimensions, such as displaying today’s active device count on a dashboard.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.