Design and Practice of Baixin Bank's Flink‑Based Real‑Time Computing Platform and Hudi‑Powered Real‑Time Data Lake
This article details Baixin Bank's construction of a Flink‑driven real‑time computing platform integrated with Hudi as a real‑time data lake, covering background, architecture, data collection, transformation, storage layers, technical challenges, future roadmap, and practical lessons for similar big‑data initiatives.
Background: Baixin Bank, the first independent direct bank, requires high data agility, leading its big data department to build a real‑time computing platform.
Platform design: The platform, built on Flink 1.12, provides real‑time ingestion, computation, storage, complex event processing, visual management, and monitoring, supporting over 320 online tasks with daily QPS around 1.7 million.
Architecture: Three‑layer architecture – data collection layer (custom Databus for MySQL binlog to Kafka), data transformation layer (Flink jobs with UDFs, data masking, enrichment), and storage layer (HDFS, Kudu, TiDB, Kafka, Hudi, MySQL).
Integration with Hudi: The platform integrates Hudi as the unified storage engine, replacing the traditional Lambda architecture. Hudi enables upserts/deletes, incremental queries, ACID guarantees, and supports both COW and MOR write modes.
Technical challenges: Dependency conflicts, classpath issues, large checkpoints, and performance bottlenecks with unpartitioned COW writes were addressed by shading Hudi dependencies, using incremental checkpoints, and choosing appropriate write modes.
Future direction: Plan to replace Kafka with Hudi as the central storage, building the data warehouse entirely on Hudi with Flink as the unified batch‑stream engine, offering schema evolution, primary‑key indexing, and timeline capabilities.
Conclusion: The article shares Baixin Bank’s experience in constructing a Flink‑based real‑time platform and a Hudi‑based real‑time data lake, providing practical guidance for others building similar systems.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.