Big Data 15 min read

Real-time Multi-dimensional Analytics and SlimBase State Backend at Kuaishou: Flink Applications and Optimizations

This article presents Kuaishou's extensive use of Apache Flink for real-time multi-dimensional analytics, detailing the platform's architecture, cluster scale, data processing pipelines, the design of a shared state storage engine called SlimBase, and performance improvements achieved through replacing RocksDB with a customized HBase‑based solution.

DataFunTalk
DataFunTalk
DataFunTalk
Real-time Multi-dimensional Analytics and SlimBase State Backend at Kuaishou: Flink Applications and Optimizations

Kuaishou, a short‑video and live‑streaming platform, leverages Apache Flink across a variety of business scenarios such as quality monitoring, user growth analysis, real‑time data processing, and live CDN scheduling. Data flows from DB/Binlog and WebService logs into Kafka, then into Flink for real‑time computation, with results persisted to Druid, Kudu, HBase, or ClickHouse, and offline processing performed on Hadoop via Hive, MapReduce, or Spark.

The typical Flink use cases at Kuaishou are categorized into three groups: 80% statistical monitoring (real‑time metrics and alerts), 15% data processing (cleaning, splitting, joins), and 5% real‑time business logic (e.g., live scheduling). Specific applications include short‑video and live‑stream quality monitoring, user acquisition analysis, real‑time ad impression‑click joins, and CDN traffic allocation.

Kuaishou's Flink cluster comprises roughly 1,500 nodes, handling about 30 trillion events per day with peak throughput of 300 million events per second. The clusters run on YARN, with separate real‑time and batch clusters isolated via YARN labels; the real‑time cluster is dedicated to Flink workloads requiring high stability.

The real‑time multi‑dimensional analysis platform is built on a custom BI tool (KwaiBI) that lets users define cube models (dimensions and metrics). Flink jobs compute these metrics using Cube or GroupingSet approaches, employing a two‑layer reduction model (full‑dimension layer and residual‑dimension layer) to manage DAG complexity. UV calculations use bitmap‑based deduplication after hashing dimension values, with string dimensions converted to long IDs via a dictionary service.

Metrics such as new users and retention are computed by integrating asynchronous calls to historical user services and maintaining double‑buffered state for retention rate calculations.

Computed results are stored in Kudu, which offers low‑latency random reads/writes and efficient column scans. Data is encoded and partitioned by time and dimension combinations to accelerate queries.

To address the high I/O overhead of RocksDB during checkpoints (up to 100% disk I/O and long checkpoint times), Kuaishou evaluated alternatives and adopted a Flink + Kudu solution. The new design eliminates costly data copying by persisting results directly to Kudu, reducing query latency.

Further optimization led to the development of SlimBase, an embedded shared‑state storage engine that replaces RocksDB. SlimBase is built by slimming down HBase—removing client, ZooKeeper, and master components, retaining only RegionServer with essential modules (cache, memstore, compaction, filesystem). Additional enhancements include a custom BitmapState for efficient bitmap storage, support for ListState, MapState, ValueState, and ReduceState, and a snapshot/restore mechanism that dramatically cuts checkpoint and restore latency.

Performance tests show that SlimBase reduces checkpoint/restore time from minutes to seconds, lowers disk I/O by 66%, cuts disk write throughput by 50%, and decreases CPU usage by 33% compared to RocksDB. Ongoing work focuses on implementing a Size‑TieredCompaction strategy and an event‑time‑driven FIFOCompaction to achieve near‑zero I/O during compaction.

Future plans aim to further optimize SlimBase and eventually replace RocksDB with SlimBase for all real‑time Flink workloads at Kuaishou.

big dataFlinkReal-time AnalyticsState BackendKuaishouSlimBase
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.