Big Data 13 min read

Modern Data Stack on Alibaba Cloud Using Flink CDC: Architecture, Features, and Use Cases

This article presents a comprehensive overview of Alibaba Cloud's modern data stack built on Flink CDC, detailing its core concepts, extended capabilities, typical application scenarios, performance optimizations, a live demo, and future development plans for large‑scale streaming data integration.

DataFunSummit

Feb 9, 2025

Modern Data Stack on Alibaba Cloud Using Flink CDC: Architecture, Features, and Use Cases

Introduction

The session introduces the topic "Modern Data Stack on Alibaba Cloud Based on Flink CDC" and outlines four main parts: the Flink CDC‑based modern data stack, CDC YAML core functions, typical application scenarios, and a demo with future outlook.

1. Flink CDC Overview

Flink CDC is a distributed data‑integration tool that uses YAML to describe data transfer and transformation, simplifying both batch and streaming data pipelines.

2. Modern Data Stack Overview

The modern data stack replaces traditional ETL by directly syncing raw data to target systems, leveraging cloud elasticity for efficient storage and processing.

3. Alibaba Cloud Practice

Source layer: adds log‑based full‑database sync (e.g., MySQL binlog → Kafka).

Extract & Load layer: supports Flink CDC jobs, DataStream jobs, and SQL jobs.

Warehouse layer: integrates databases such as Paimon, StarRocks, and high‑performance storage like Hologres.

Transform layer: uses Spark or Flink jobs for reporting, real‑time dashboards, etc.

4. Real‑time Compute Integration

YAML‑based job templates (e.g., MySQL → Paimon, MySQL → StarRocks).

Automatic connector dependency handling.

Enhanced monitoring metrics for snapshot and incremental phases.

Support for multi‑link sync, full job lifecycle management, and CDC YAML version control.

5. CDC YAML Core Functions

5.1 Supported Sync Links

Source: MySQL, Kafka.

Sink: Paimon, StarRocks, Hologres, full‑library Kafka sync, raw binlog → Kafka.

5.2 Transform & Route

Transform: add computed columns, metadata columns, built‑in functions, UDFs, partition/key handling, filtering, column pruning.

Route: one‑to‑one and many‑to‑one table mapping, pattern‑based table naming.

5.3 Monitoring Metrics

Snapshot status (isSnapshotting, isBinlogReading, table counts, split counts).

Data metrics (read timestamps, lag, record counts per table, snapshot record counts).

5.4 Other Features

Fine‑grained schema change control (e.g., DELETE, DROP TABLE).

Support for additional change types (e.g., TRUNCATE).

Tolerant mode for schema evolution.

Raw binlog sync to Kafka.

6. YAML vs. SQL / CTAS / CDAS / DataStream

YAML auto‑detects schema, supports real‑time schema changes, and handles raw changelog formats.

SQL requires manual schema definition and lacks real‑time change support.

YAML can read/write multiple tables, whereas SQL is limited to single‑table operations.

7. Typical Application Scenarios

Full‑database sync (initial snapshot + incremental).

Full sync with Transform and Route (e.g., adding version suffixes).

Sharding and merging multiple source tables into one target.

Tolerant mode to map diverse source types to a unified target type.

Full sync to Kafka for downstream consumption.

Binlog raw data sync to Kafka (Debezium JSON, Canal JSON).

Fine‑grained change control (add/drop tables/columns, rename columns, etc.).

Automatic capture of newly added tables (snapshot and incremental options).

Point‑in‑time full refresh to handle upstream version incompatibilities.

MySQL CDC performance optimizations (binlog parameter tuning, filter irrelevant tables, parallel parsing/serialization).

8. Demo and Future Outlook

Demo: full‑database sync to Paimon and binlog sync to Kafka, with schema evolution monitoring.

Future plans: dirty‑data handling, data throttling for MySQL CDC, expanding upstream/downstream ecosystem, more data lake and warehouse integrations.

Documentation links to the Beta data ingestion version and open‑source Flink CDC.

Conclusion

Thank you for attending the session.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming data integration Alibaba Cloud Flink CDC modern data stack

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.