Backend Development 8 min read

Using Kafka as a Storage System for Twitter’s Account Activity Replay API

The article explains how Twitter built the Account Activity Replay API by repurposing Kafka as a storage layer, detailing the system’s architecture, partitioning strategy, request handling, deduplication, and performance optimizations to provide reliable event recovery for developers.

Top Architect
Top Architect
Top Architect
Using Kafka as a Storage System for Twitter’s Account Activity Replay API

When developers consume Twitter’s public data via API they need guarantees of reliability, speed, and stability; to address this, Twitter introduced the Account Activity Replay API, a data‑recovery tool that can retrieve events up to five days old and recover missed events caused by server interruptions.

The initiative also aimed to boost internal engineers’ productivity, keep the system maintainable, and reduce context‑switching for developers, site‑reliability engineers, and other stakeholders.

To minimize redesign effort, the replay system reuses the existing real‑time publish‑subscribe architecture and adopts Kafka as a storage backend, turning the streaming platform into a durable log.

Two data centers generate real‑time events that are written to cross‑DC Kafka topics for redundancy. An internal filtering service consumes these topics, checks each event against rules stored in a key‑value store, and decides whether to forward the event to developers via Webhook URLs, each identified by a unique ID.

Instead of a Hadoop/HDFS stack, Kafka was chosen because the real‑time system already uses a pub/sub model, the replay workload does not reach petabyte scale, and Kafka offers lower latency and cost. Topics named delivery_log are cross‑replicated, partitioned by the hash of the Webhook ID, and stored on SSDs with Snappy compression to speed up decompression.

Replay requests are submitted through an API, persisted in MySQL as a queue, and contain the Webhook ID and the date range of events to replay. The replay service polls MySQL, determines the correct Kafka partition using the Webhook ID, and uses the consumer’s offsetForTimes function to locate the start offset. Jobs transition through states (OPEN, STARTED, ONGOING, COMPLETED) with heartbeat monitoring to restart stalled jobs.

During event retrieval, a hash cache deduplicates events so that identical events are not delivered twice.

Overall, the solution demonstrates a novel use of Kafka as a storage system, delivering faster event recovery, easier maintenance, and meeting the performance expectations of Twitter’s developers.

backendKafkaAPITwitterstoragereplayinfrastructure
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.