Design and Implementation of a Lakehouse Data Platform Based on Apache Hudi at Taikang Life Insurance
This article details Taikang Life Insurance's end‑to‑end technical selection, architecture design, implementation, and custom enhancements of an Apache Hudi‑driven lakehouse platform for large‑scale health‑insurance data, covering background, component evaluation, performance benchmarking, multi‑layer architecture, and real‑world results.
Abstract
The paper presents the technical selection, overall architecture, and implementation of a lakehouse‑style distributed data processing platform built on Apache Hudi for Taikang Life Insurance, focusing on the big‑health domain and the company's strategic goals.
Background
Driven by national "Healthy China" policies and rapid growth in the health‑insurance market, Taikang faced data silos caused by fragmented business‑line databases and large physical‑machine deployments. To overcome low data‑reuse efficiency and high management costs, a unified lakehouse platform was deemed essential.
Concepts
Definitions of Data Lake, Data Warehouse, and Lakehouse are clarified using Microsoft and IBM descriptions, highlighting the lakehouse’s combination of fast ingestion, multi‑modal data support, transactional guarantees, and centralized governance.
Technical Selection
Three open‑source lake components—Apache Iceberg, Apache Hudi, and Delta Lake—were evaluated across community momentum, feature set, and performance. Community metrics (stars, forks, PRs, issue resolution) favored Hudi, especially in China. Feature comparison showed Hudi best satisfied Taikang’s functional requirements, and benchmark tests on a 7400‑million‑record insurance dataset confirmed comparable performance to Delta Lake and superiority over Iceberg.
Selection Result
Active community with diverse contributors and strong development momentum.
Key lake features (fast ingestion, upserts, Flink integration) fully meet business needs.
Performance meets the platform’s throughput requirements.
Lakehouse Architecture
The architecture consists of five layers:
Data Sources: Enterprise DB2, other commercial databases, and Kafka for unstructured streams.
Processing Layer: Primarily Apache Flink (with some Spark), providing unified batch‑and‑stream processing and custom connectors (e.g., flink‑db2).
Infrastructure Layer: HDFS on physical machines plus Taikang Cloud OSS for object storage.
Lake Platform Layer: Apache Hudi as the transactional table format and streaming data‑lake service.
Data Modeling Layer: Tables built on Hudi, supporting upserts and schema evolution.
Data Access Layer: Hive Metastore (with Kerberos), Trino, ClickHouse, and REST APIs for discovery, governance, and permission control.
Data Application Layer: BI, ad‑hoc queries, visualizations, and downstream analytics.
Implementation Details
Version selection for all components is documented, and three typical challenges are addressed:
Synchronizing Hudi metadata with Hive Metastore and exposing it via Trino catalogs.
Mitigating small‑file explosion by using Hudi’s clustering (Merge‑On‑Read) after fast ingestion.
Enabling Kerberos authentication for Hudi by patching source code to interoperate with secured HDFS and Hive.
Custom Extensions to Hudi
Two domain‑specific enhancements were developed:
Multi‑field primary‑key upserts: Allows a single upsert to update only the relevant field groups (e.g., policy, pension, dental) without overwriting other columns, reducing reliance on Flink state.
Multi‑event‑time validation: Guarantees that the latest record is persisted even when out‑of‑order events arrive, supporting strict accuracy requirements in health‑insurance data.
Use‑Case and Results
The extensions were validated on the "real‑time policy acceptance" scenario, processing >50 k policy updates daily (≈600 k total record operations) with 100 % data‑accuracy and no loss, demonstrating the value of the lakehouse approach.
Since production, the platform manages ~300 TB of data, >100 streaming jobs, >1 200 ETL tasks, and supports diverse analytics such as user behavior, compliance, OLAP, and visualization.
Future Work
Planned directions include expanding component integration for ML/DL workloads, strengthening monitoring, fault‑tolerance, and disaster recovery, and further tailoring Hudi to the unique characteristics of the big‑health domain.
Conclusion
The article provides a comprehensive case study of building a lakehouse data platform with Apache Hudi, covering selection, architecture, implementation, custom development, and measurable business impact, offering practical insights for similar large‑scale data initiatives.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.