Big Data 8 min read

Designing a Scalable Real‑Time Mobile Analytics Platform with Kafka, Storm, and Amazon EMR

The article describes how a mobile analytics service processes billions of events daily using a Lambda‑style architecture that combines Kafka, Storm, Amazon EMR, and S3 to achieve scalable, fault‑tolerant batch and real‑time computation, while ensuring reliable event ingestion and graceful degradation.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Designing a Scalable Real‑Time Mobile Analytics Platform with Kafka, Storm, and Amazon EMR

Last year the team launched "Answers," a mobile analytics service that now handles 5 billion sessions per day, with hundreds of millions of requests per second and tens of millions of analysis events processed.

The core challenge is to turn this massive stream of data into reliable, real‑time insights for mobile developers.

At a high level the architecture follows component decoupling, asynchronous communication, and graceful degradation, employing a Lambda‑style design that merges batch and streaming pipelines.

Event reception is built to minimize battery and network impact: devices batch and compress analytics payloads, retry with exponential back‑off on failure, and trigger uploads based on time, message count, or app backgrounding, resulting in tens of thousands of compressed payloads per second.

Incoming payloads are placed into a durable Kafka queue behind an Amazon ELB. Kafka provides persistence and replication, after which data is quickly off‑loaded to Amazon S3 for long‑term storage.

For batch processing, the team uses Amazon EMR with the Cascading framework to run MapReduce jobs on the S3 data, writing results back to S3 and finally loading them into a Cassandra cluster for sub‑second API queries.

Because the original pipeline was not truly real‑time, a separate Storm topology now consumes the same Kafka topics, applying the same logic as the batch jobs but in a streaming fashion. Probabilistic algorithms such as Bloom filters and HyperLogLog are used to keep resource usage low while maintaining acceptable accuracy.

The API merges the two data sets: historical queries rely on batch‑generated Cassandra data, while recent data (the last day or two) is supplemented with real‑time results, ensuring up‑to‑date answers.

Error handling is built in at every layer: device‑side retries guarantee eventual delivery, Kafka’s durability prevents data loss during real‑time failures, and the decoupled design allows the batch layer to continue operating when streaming fails (and vice‑versa). Engineers are alerted to any outages, and the system gracefully degrades without user‑visible downtime.

Overall, the system consists of four main components—event ingestion, durable storage, batch computation, and real‑time computation—linked by persistent queues that isolate failures and enable seamless recovery, providing a robust analytics platform for mobile applications.

big datareal-time analyticsKafkaAWSmobile analyticsstorm
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.