Big Data 31 min read

ZhongAn's Hundred‑Billion‑Scale Data Integration Service: Architecture, Business Support, and Evolution

This article presents the architecture and practical experience of ZhongAn's hundred‑billion‑scale data integration service, covering common integration technologies, business support scenarios for offline and real‑time data, technical challenges, evolution from single‑machine to service‑oriented designs, and future directions using Flink and DataX.

DataFunTalk
DataFunTalk
DataFunTalk
ZhongAn's Hundred‑Billion‑Scale Data Integration Service: Architecture, Business Support, and Evolution

Introduction

The article introduces the architecture practice of ZhongAn's hundred‑billion‑scale data integration service, focusing on three aspects: data integration technologies, business support cases, and the technical evolution roadmap.

1. Data Integration and Common Technologies

Data integration is the first step of a data‑mid‑platform, enabling data exchange between business systems and breaking data silos. Typical ETL tasks move data from source systems to a data warehouse or lake, then to downstream analytic or business stores.

Common integration scenarios include:

Data migration (e.g., moving on‑premise CDH/Hadoop to public clouds)

Data warehouse / data lake ETL (both real‑time and batch)

Platform fusion (pushing data to multiple targets such as relational, graph, or index stores)

Disaster recovery and backup (e.g., copying critical ES indices to HDFS)

Key considerations when building or selecting a data integration service are synchronization mechanisms (CDC via query vs. log), support for full‑ and incremental modes, concurrency and distributed capabilities, minimal component chain for easier monitoring, transformation abilities (masking, type conversion, enrichment), file‑based vs. streaming approaches, low‑code configuration, and community activity of underlying open‑source tools.

2. Business Support Cases

Offline Data Integration

Offline integration handles traditional warehouse ETL with scheduling frequencies from minutes to days. ZhongAn uses DataX 3.0, publishing jobs via an internal DataIDE platform. In 2017 the system processed ~2,000 daily jobs; by 2022 it handles over 15,000 tasks across 40+ heterogeneous sources (ODPS, RDS, CK, HDFS, etc.), synchronizing up to 300 billion rows per day.

Real‑time Data Integration

Real‑time integration streams data back to MaxCompute for hourly or minute‑level warehouse refreshes, reducing latency for financial risk‑control scenarios. The stack consists of a custom Binlog collection service (Blcs) and FlinkSQL running on Flink‑on‑YARN. Job count grew from 200+ in 2020 to over 1,000 in 2022, executed on a 60‑node CDH cluster that is being migrated to Kubernetes.

Challenges

Complex job configuration (high learning curve for DataX JSON, FlinkSQL, etc.)

Configuration replacement across environments (test ↔ prod) leading to massive JSON edits

Resource contention on nodes running many DataX processes (CPU load >250, memory pressure)

Stability of long‑running jobs and need for fine‑grained resource isolation

3. Technical Evolution Roadmap

Architecture Evolution

Stage 1 – Single‑machine mode (2014‑2017): DataX runs as a multi‑process local tool.

Stage 2 – Service‑oriented mode: job configuration, data‑source management, and integration services are centralized via DataIDE and a dedicated integration service.

Stage 3 – Real‑time focus: building a unified stream‑batch component and real‑time integration services, co‑developing underlying stream‑processing platforms.

Offline Integration Technology Selection

DataX is chosen because most workloads are under a million rows (handled in minutes) and a significant portion (15 %) are in the 10 million‑to‑billion range, which can be completed within 1‑5 hours after tuning. For the few hundred‑billion‑scale RTA tasks, a combination of file‑based ingestion and high‑performance servers is used.

Single‑Machine Mode Design

Lightweight deployment with minimal dependencies, suitable for moderate data volumes.

Scheduling via custom or open‑source schedulers (DolphinScheduler, XXL‑Job, Azkaban, Airflow).

When a node becomes a bottleneck, the architecture shifts to distributed scheduling, launching multiple DataX JVMs per executor.

Service‑Oriented Mode

Job development is done in a data workbench that auto‑generates source/target mappings and validates configurations before publishing.

Configuration replacement is handled by abstracting data‑source names, allowing seamless switch between test and production environments.

Service APIs manage job lifecycle, worker registration, logging, and system monitoring, enabling HA with dual masters.

Real‑time Integration

Motivation: ultra‑large back‑fill jobs, burst traffic during events (e.g., Double 11), and low‑latency requirements (sub‑50 ms response).

Use cases include real‑time sync to graph databases for risk control, event‑driven tagging to HBase/TableStore, and streaming ETL to Hive or data lakes.

Technology stack: Flink (stream‑batch unified, 1.12+), Flink CDC (embedded Debezium, no external Kafka needed), containerized deployment on YARN/Kubernetes, and plug‑in connectors for diverse sources.

Design a “One Configure” model where both offline (DataX) and real‑time (Flink) jobs share a unified JSON configuration, enabling seamless switching between batch and streaming execution.

Q&A Highlights

DataX can reach ~100 MB/s (≈5‑6 × 10⁸ rows in 40 minutes) on a dedicated server.

Flink CDC can operate without Kafka by embedding Debezium directly.

Other open‑source tools (e.g., TIS) have comparable features but ZhongAn’s solution remains internal and commercialized.

Future plans include extending CDC to Oracle (via GoldenGate) and integrating more data‑source connectors.

In summary, ZhongAn’s data integration service demonstrates a pragmatic evolution from simple single‑node ETL tools to a sophisticated, service‑oriented platform that balances cost, performance, and scalability for both batch and real‑time data pipelines.

big dataReal-time ProcessingFlinkdata platformDataXETLdata integration
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.