Backend Development 18 min read

Design and Implementation of a Scalable Live‑Streaming Full‑Stream Data System

The article details a scalable live‑stream full‑stream data system that replaces a tightly‑coupled legacy architecture with a producer‑consumer model using a custom key‑value store, bucket sharding, gRPC server‑streaming, versioned caching, and comprehensive observability, achieving sub‑second queries, horizontal scalability, and reliable support for thousands of downstream services.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Design and Implementation of a Scalable Live‑Streaming Full‑Stream Data System

The article introduces the concept of “full‑stream (全量在播)” in live‑streaming, which refers to the complete set of online anchors’ information (room IDs, UIDs, etc.) that serves as the foundational data for many upstream services.

Existing legacy systems could not support higher traffic volumes and had caused several production incidents due to slow queries, large payloads, and tightly coupled architecture.

Challenges

1. Legacy system is tightly coupled with room‑service resources and needs to be decoupled.

2. Full‑scan SQL jobs cannot finish within 5 seconds, leading to DB throttling and time‑outs.

3. Interfaces are fragmented, non‑standard, and return massive payloads (>10 MB).

4. Storing millions of keys in Redis creates large‑key risks.

5. The system runs in a multi‑zone active‑active setup and must preserve that property.

6. Low real‑time capability and poor horizontal scalability cause hot‑key problems.

7. Lack of observability and data backup hampers incident diagnosis.

Design Principles

Ease of understanding

Adaptability to change

Elasticity

Recoverability

Technical Selection

For storage, a self‑developed key‑value store “taishan kv” is chosen because it supports large keys, persistence, and native multi‑active deployment, directly addressing challenge 5.

Data transmission uses gRPC server‑streaming instead of unary RPC to handle large payloads efficiently.

The overall architecture follows a classic producer‑consumer model, consisting of three layers: production, consumption, and SDK.

Production System

The production side processes dynamic data each cycle, merging base data with incremental “on/off‑stream” information to produce near‑real‑time results. The workflow includes:

Fetch bucket configuration to partition the massive stream data.

Initialize an incremental key for the consumer side.

Mark the current cycle as active.

Execute batched SQL queries to retrieve the latest stream records.

Aggregate and transform data, then split according to bucket rules.

Persist the partitioned data into taishan kv.

Clean up compensation data, update metadata, and write the latest cycle info.

Bucket‑based sharding reduces the size of each batch, mitigating challenge 4. Parallel queries (≈250 ms for 2 M rows) replace the previous 5‑6 s full scans, solving challenge 2. Monitoring of SQL latency is also added.

The consumer side simply checks the flag set in step 3 to decide whether to write new‑cycle increments or fall back to the previous cycle. MQ ensures ordered delivery, and a message‑expiry mechanism prevents stale data from contaminating the stream.

Because multiple consumer threads may write to the same large key, taishan’s casMultiKey API is used with a small auxiliary key to achieve atomic updates.

Persisted data are kept for seven days at minimal storage cost, providing a backup for downstream troubleshooting.

Consumption System

The consumer fetches the latest metadata, retrieves base shards and increments, merges them, and serves the result to the SDK. A local in‑memory cache holds a versioned snapshot, enabling horizontal scaling without I/O bottlenecks.

Data is streamed to upstream services via gRPC server‑streaming, delivering complete versioned chunks in ten segments, which balances latency and reliability.

Custom metrics for gRPC streaming are added to fill the monitoring gap.

SDK Design

The SDK unifies the previously scattered interfaces, hides the complexity of gRPC streaming, and offers a rich API set. It also employs the same local cache and versioning strategy to ensure real‑time data delivery while reducing QPS pressure on the backend.

Data Versioning

Both production and consumption layers generate a timestamp‑based version for each cycle. The version is monotonically increasing, allowing easy detection of stale or missing updates. Each service instance reports its current and previous version metrics for health monitoring.

The SDK also reports its version metrics, enabling detection of upstream anomalies (stagnant upstream version) or internal issues (stagnant SDK version).

Data Correctness Verification

The new system’s output is compared against the legacy system using a diff‑based formula that counts opens and closes on both sides. Tests with 1‑5 s update intervals show that the new system achieves slightly better latency at 4‑5 s and a clear advantage at 3 s.

Observability Construction

A full‑link monitoring suite covers production, consumption, and SDK layers, providing metrics such as live‑stream count, open/close trends, query latency, MQ delay, CAS collisions, and version health.

Additional dashboards visualize vertical health (service‑level data status) and horizontal scaling capacity.

Consumer‑layer monitoring adds gRPC‑stream metrics and SDK‑API QPS; SDK‑layer adds task latency and configuration watches.

Summary

The new live‑stream full‑stream system benefits both new and legacy services, reducing memory usage for historical jobs and supporting over 30 downstream services that have run stably for more than half a year across major events.

It meets the “ten‑million online, one‑hundred‑thousand live” target and provides a flexible, extensible foundation for future growth.

References

Architect’s 37 Things (Douban) – https://book.douban.com/subject/35062026/

Core concepts, architecture and lifecycle | gRPC – https://grpc.io/docs/what-is-grpc/core-concepts/#server-streaming-rpc

Building Secure and Reliable Systems – https://google.github.io/building-secure-and-reliable-systems/raw/toc.html

distributed systemsSystem ArchitectureData Pipelinelive streamingObservabilitygRPC
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.