Operations 10 min read

How to Use Redis for Efficient Deduplication in Operations Data Analysis

This article explains practical methods for deduplicating and counting data in operational analytics using Redis, covering SET, ZSET, BITSET, HyperLogLog, and Bloom filter structures, their advantages, limitations, and suitable scenarios for real‑time and large‑scale metric calculations.

Efficient Ops
Efficient Ops
Efficient Ops
How to Use Redis for Efficient Deduplication in Operations Data Analysis

Overview

Today, a data platform engineer shares practical ways to perform deduplication and counting in operational data analysis using Redis. The article defines the following Redis data structures for the discussion:

SET

:

sadd key member
ZSET

:

zadd key score member
HYPERLOGLOG

:

pfadd key element
STRING

:

setbit key offset value

Key terms are also defined:

Dimension

(e.g., version, OS type, device model) and

Composite Dimension

(combination of two or more dimensions).

1. Set‑based Deduplication

This straightforward method inserts each unique element as a member of a Redis SET and uses

SCARD

to get the cardinality. Dimensions can be encoded in the key name.

Advantages: Simple to use, precise counting.

Disadvantages: Not suitable for real‑time statistics; high memory consumption grows with each additional dimension, making it impractical for three or more composite dimensions.

Applicable Scenarios: When the unique count is very small, such as counting a limited number of active devices in a mobile app.

2. ZSET‑based Deduplication

Using a sorted set, the key stores dimension information, the score stores a timestamp, and the member stores the device ID.

ZCOUNT

retrieves the number of members within a time range.

Advantages: Insertion and counting are O(log N); can precisely count users from now to any past point while preserving raw data.

Disadvantages: Only left‑bounded time intervals are supported; memory usage grows with each dimension, making it unsuitable for three or more composite dimensions.

Applicable Scenarios: Real‑time active‑device counting over short windows (e.g., 1 min, 5 min, 10 min) where only a few dimensions are needed.

3. Bitset‑based Deduplication

Map each user ID to a single bit in a bitmap; modern CPUs can compute bit operations very quickly.

Advantages: Precise results; supports AND/OR aggregation across dimensions; low‑cost operations for cross‑dimension calculations.

Disadvantages: Not suitable for minute‑level granularity; mapping user IDs to bits can cause hash collisions or require a large in‑memory mapping table.

Applicable Scenarios: Storing atomic data for long‑term or flexible time‑span aggregations.

Note: In Java, BitSet stores bits from low to high within a byte, while Redis stores them from high to low.

4. HyperLogLog‑based Deduplication

HyperLogLog provides probabilistic cardinality estimation using only about 12 KB per instance with an error rate of 0.81 %.

<code>http://blog.codinglabs.org/tag.html#基数估计</code>

Advantages: Extremely low memory usage; can merge multiple HyperLogLog structures to obtain deduplicated counts for arbitrary time ranges; O(1) update time.

Disadvantages: Not suitable when exact counts are required.

Applicable Scenarios: Situations where a small error margin is acceptable, similar to bitset use cases.

5. Bloom Filter‑based Deduplication

Bloom filters use multiple hash functions to achieve low false‑positive rates with high space efficiency. Redis does not have a native Bloom filter type, but it can be implemented via Lua scripts.

<code>https://github.com/erikdubbelboer/redis-lua-scaling-bloom-filter1</code>

Advantages: For 2 million users, an error rate below 0.01 % can be achieved with roughly 8 MB of Redis memory.

Disadvantages: Bloom filter results cannot be merged.

Applicable Scenarios: Real‑time metrics with few dimensions, such as displaying today’s active device count on a dashboard.

Big DataoperationsHyperLogLogRedisdata analysisdeduplication
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.