Big Data 15 min read

How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans

This article details the design, evolution, and optimization of Kuaishou's data synchronization platform, covering business overview, architecture, key technologies, performance tuning, data source protection, incremental data lake integration, and future roadmap for a unified data fabric.

Kuaishou Big Data
Kuaishou Big Data
Kuaishou Big Data
How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans

1. Business Introduction

The article introduces Kuaishou's data synchronization middle platform, which synchronizes production data to the big data platform (ODS layer) and distributes high‑value assets back to production for online services.

Business overview

Architecture design

Key technologies

Future planning

2. Architecture Design

Full‑Link Overview

Data sources are divided into three categories: user behavior logs, service logs, and database changes. After entering the message queue, data follows two pipelines: a real‑time chain for second‑/minute‑level processing and an offline chain for longer‑term processing.

Layered Structure

The bottom layer abstracts ~20 data source types (schema‑ful and schema‑less). A data‑source management system provides a unified catalog, turning each source into a virtual table. The middle layer offers a global data‑catalog service that maps virtual tables to physical tables, enabling dynamic data access and transformation.

System Architecture

The synchronization service is split into four layers: API layer (job creation and management), Master (scheduling, schema evolution, job compilation), Worker (execution, stateless, high‑throughput), and a governance system for resource, priority, and health management.

3. Key Technologies

Challenges

Massive data scale, dozens of heterogeneous sources, and strict timeliness/accuracy requirements make any small issue amplify dramatically.

All‑as‑Table

Both schema‑ful and schema‑less sources are abstracted as virtual tables. For Kafka, each topic’s key, attribute, and value (often Protobuf) are described, registered, and mapped to a virtual table, enabling unified processing.

Timeliness Optimization

Multi‑threaded asynchronous processing boosts per‑node performance. When traffic spikes or historical back‑fills occur, the system scales partitions and threads, while also addressing tail latency through priority‑based throttling, automatic load balancing, and baseline‑core provisioning.

Data Source Assurance

A dynamic throttling mechanism, driven by the data‑source management system, adapts to real‑time load changes to protect online services. Consistency monitoring selects high‑traffic tables, extracts hourly diffs, and alerts on any loss, eliminating silent data‑loss incidents.

Incremental Data Lake

Replacing traditional batch‑only warehouses, an incremental data lake (e.g., Apache Hudi) supports updates, snapshots, and transactions, reducing latency from hours to minutes and cutting storage costs by over 80%.

4. Future Planning

Data Catalog Service

The catalog will become the foundation for a unified data fabric, aiming to cover all sources, bridge lake‑warehouse metadata, and harmonize streaming and batch tables for one‑click development.

architecturebig datareal-time processingData Synchronizationincremental data lake
Kuaishou Big Data
Written by

Kuaishou Big Data

Technology sharing on Kuaishou Big Data, covering big‑data architectures (Hadoop, Spark, Flink, ClickHouse, etc.), data middle‑platform (development, management, services, analytics tools) and data warehouses. Also includes the latest tech updates, big‑data job listings, and information on meetups, talks, and conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.