Backend Development 20 min read

Designing a High‑Performance Asynchronous Event System for Video Likes Using CQRS, Kafka, and Multi‑Level Caching

This article walks through the evolution of a video‑like service from a simple database‑centric design to a robust, CQRS‑based, Kafka‑driven asynchronous architecture that tackles CPU bottlenecks, connection limits, duplicate consumption, scaling, flow‑control, hotspot isolation, error retry, and MQ failure while providing a unified messaging platform.

Architect
Architect
Architect
Designing a High‑Performance Asynchronous Event System for Video Likes Using CQRS, Kafka, and Multi‑Level Caching

Introduction: Bilibili’s near‑100 million daily active users generate massive interaction traffic, putting extreme pressure on the backend. To improve scalability the team adopted a micro‑service + CQRS architecture and built the Railgun asynchronous event platform, which now supports hundreds of business services.

Example – Video Like Feature: The basic like functionality requires counting likes and querying like status. Two MySQL tables ( counts and actions ) are created to store aggregate counts and per‑user actions.

Initial Architecture: A stateless Like service writes directly to the MySQL cluster. This works when traffic is low but quickly exhausts CPU resources on both the service and the database.

CPU & Connection Issues: Scaling the stateless service with Kubernetes solves the service‑side bottleneck, but the database becomes the limiting factor. A cache‑aside pattern is introduced: writes go to MySQL then update Redis; reads prefer Redis and fall back to MySQL when a cache miss occurs.

Connection‑Pool Saturation: As the number of Like service instances grows, each opens a MySQL connection, eventually exhausting the connection pool. The solution is to add a database‑access proxy service to multiplex connections.

CQRS & Kafka Integration: To decouple writes from the database, the architecture adopts CQRS and introduces Kafka. The Like service publishes a message to Kafka and returns immediately. A Job service consumes the messages, writes to MySQL and updates the cache, and uses key‑ordering partitions (video_id) to avoid row‑level lock contention.

Duplicate Consumption Handling: Kafka rebalance can cause duplicate deliveries. The system checks the actions table for existing likes to ensure idempotency, and optionally uses a temporary cache as a lightweight behavior table.

Consumption Capacity Problems: Adding more Job nodes did not increase throughput because the number of Kafka partitions was a hard limit. After increasing partitions, throughput improved, but further scaling required multi‑threaded processing within each Job instance.

Multi‑Threaded Job Design: Each Job instance now runs a thread pool with per‑key memory queues, preserving order by routing messages with the same key to the same queue. This raises processing speed without OOM risks.

Message Loss on Restart: ACKs were sent per‑thread, causing unprocessed messages to be lost after a restart. The new design acknowledges only after all preceding messages in a linked list are processed, ensuring no loss.

Data Aggregation: To reduce DB writes, likes for the same video are aggregated in memory and flushed in batches, dramatically cutting the number of update statements.

ACK Optimization: Synchronous ACKs are replaced with asynchronous or batched ACKs, reducing network I/O and improving throughput by over tenfold.

Flow Control: A hybrid rate‑limiting scheme combines a global Redis token bucket with local token‑bucket throttling, allowing fine‑grained control while avoiding single‑point latency spikes.

Hotspot Event Isolation: When a single video becomes a hot key, its events are detected and redirected to a dedicated hotspot queue with separate consumer threads, preventing one hot video from throttling the entire system.

Error Retry Strategies: Both in‑place exponential‑backoff retries and delayed re‑queueing of failed messages are employed, achieving a 99.999% success rate.

MQ Failure Mitigation: In case of Kafka outage, producers automatically fall back to pushing messages directly into the Job’s in‑memory queues, preserving ordering via consistent hashing and ensuring continuity without a separate backup MQ.

Platformization: All the above patterns are abstracted into a unified messaging programming model with a control plane that supports multiple underlying MQs (Kafka, Pulsar, internal Databus), dynamic processing modes, global consumption throttling, alerting, and automated diagnostics.

Code Example – Producer (Go): producer, err := railgun.NewProducer("id") // Send Bytes err = producer.SendBytes(ctx, "test", []byte("test")) // Send String err = producer.SendString(ctx, "test", "test") // Send JSON err = producer.SendJSON(ctx, "test", struct{ Demo string }{Demo: "test"}) // Batch Send err = producer.SendBatch(ctx, []*message.Message{{Key: "test", Value: []byte("test")}})

Code Example – Consumer (Go): // Create processor (single or batch) processor := single.New(unpack, do) // or batch.New(unpack, preDo, batchDo) // Start consumer consumer, err := railgun.NewConsumer("id", processor)

Conclusion: By defining a unified messaging model, providing a rich control plane, and applying the described optimizations, the team solved common high‑performance, high‑reliability challenges of asynchronous event processing, offering a reference architecture for building scalable backend systems.

microservicesscalabilityKafkaMessage QueueRate LimitingCQRSAsync Processing
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.