Big Data 7 min read

How Reddit Counts Page Views at Scale Using HyperLogLog and Kafka

The article explains Reddit's large‑scale page‑view counting system, detailing its real‑time requirements, the challenges of naive hash‑set storage, and how a hybrid approach using linear probability and HyperLogLog algorithms together with Kafka, Redis, and Cassandra achieves accurate, low‑memory, near‑real‑time analytics.

Selected Java Interview Questions

May 16, 2020

How Reddit Counts Page Views at Scale Using HyperLogLog and Kafka

Reddit needs a real‑time, accurate page‑view counting system that records each user only once per time window, keeps error within a few percent, and processes counts within seconds in production.

A simple solution of maintaining a per‑article hash table of unique user IDs works for small traffic but becomes infeasible when popular posts attract millions of viewers, as storing millions of 8‑byte IDs would require many megabytes of memory per article.

To reduce memory while preserving accuracy, Reddit evaluates two cardinality‑estimation methods: a linear probability algorithm that is highly precise but memory‑intensive, and HyperLogLog (HLL), which uses non‑linear memory growth with slightly lower precision.

An example shows that storing one million unique user IDs would need about 8 MB, whereas an HLL sketch for the same data consumes only ~12 KB (0.15 % of the naive approach).

In practice Reddit adopts a hybrid strategy: use the linear method for low‑cardinality data and switch to HLL once a threshold is reached, combining the strengths of both techniques.

Several implementations are considered, including Twitter's Scala‑based Algebird library, the Java‑based stream‑lib HyperLogLog++ implementation, and Redis's built‑in HLL support, which was ultimately chosen for its documentation, API, and reduced CPU/memory concerns.

The data pipeline relies on Apache Kafka: each view event is sent to a Kafka topic, then processed by two sequential consumers. The first consumer, called Nazar, filters events, de‑duplicates rapid repeat views, and tags events for counting. The second consumer, Abacus, performs the actual counting: it checks Redis for an existing HLL sketch, issues a PFADD if present, or creates a new sketch in Cassandra and stores it in Redis when absent.

To persist HLL data and avoid Redis eviction, Abacus periodically batches HLL sketches from Redis and writes them back to Cassandra in 10‑second groups, ensuring durability without overloading the cluster.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data HyperLogLog Redis Kafka page view counting Reddit

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.