How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans
This article details the design, evolution, and optimization of Kuaishou's data synchronization platform, covering business overview, architecture, key technologies, performance tuning, data source protection, incremental data lake integration, and future roadmap for a unified data fabric.
1. Business Introduction
The article introduces Kuaishou's data synchronization middle platform, which synchronizes production data to the big data platform (ODS layer) and distributes high‑value assets back to production for online services.
Business overview
Architecture design
Key technologies
Future planning
2. Architecture Design
Full‑Link Overview
Data sources are divided into three categories: user behavior logs, service logs, and database changes. After entering the message queue, data follows two pipelines: a real‑time chain for second‑/minute‑level processing and an offline chain for longer‑term processing.
Layered Structure
The bottom layer abstracts ~20 data source types (schema‑ful and schema‑less). A data‑source management system provides a unified catalog, turning each source into a virtual table. The middle layer offers a global data‑catalog service that maps virtual tables to physical tables, enabling dynamic data access and transformation.
System Architecture
The synchronization service is split into four layers: API layer (job creation and management), Master (scheduling, schema evolution, job compilation), Worker (execution, stateless, high‑throughput), and a governance system for resource, priority, and health management.
3. Key Technologies
Challenges
Massive data scale, dozens of heterogeneous sources, and strict timeliness/accuracy requirements make any small issue amplify dramatically.
All‑as‑Table
Both schema‑ful and schema‑less sources are abstracted as virtual tables. For Kafka, each topic’s key, attribute, and value (often Protobuf) are described, registered, and mapped to a virtual table, enabling unified processing.
Timeliness Optimization
Multi‑threaded asynchronous processing boosts per‑node performance. When traffic spikes or historical back‑fills occur, the system scales partitions and threads, while also addressing tail latency through priority‑based throttling, automatic load balancing, and baseline‑core provisioning.
Data Source Assurance
A dynamic throttling mechanism, driven by the data‑source management system, adapts to real‑time load changes to protect online services. Consistency monitoring selects high‑traffic tables, extracts hourly diffs, and alerts on any loss, eliminating silent data‑loss incidents.
Incremental Data Lake
Replacing traditional batch‑only warehouses, an incremental data lake (e.g., Apache Hudi) supports updates, snapshots, and transactions, reducing latency from hours to minutes and cutting storage costs by over 80%.
4. Future Planning
Data Catalog Service
The catalog will become the foundation for a unified data fabric, aiming to cover all sources, bridge lake‑warehouse metadata, and harmonize streaming and batch tables for one‑click development.
Kuaishou Big Data
Technology sharing on Kuaishou Big Data, covering big‑data architectures (Hadoop, Spark, Flink, ClickHouse, etc.), data middle‑platform (development, management, services, analytics tools) and data warehouses. Also includes the latest tech updates, big‑data job listings, and information on meetups, talks, and conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.