Big Data 44 min read

Design and Implementation of a Lakehouse Data Platform Based on Apache Hudi at Taikang Life Insurance

This article details Taikang Life Insurance's end‑to‑end technical selection, architecture design, implementation, and custom enhancements of an Apache Hudi‑driven lakehouse platform for large‑scale health‑insurance data, covering background, component evaluation, performance benchmarking, multi‑layer architecture, and real‑world results.

DataFunTalk
DataFunTalk
DataFunTalk
Design and Implementation of a Lakehouse Data Platform Based on Apache Hudi at Taikang Life Insurance

Abstract

The paper presents the technical selection, overall architecture, and implementation of a lakehouse‑style distributed data processing platform built on Apache Hudi for Taikang Life Insurance, focusing on the big‑health domain and the company's strategic goals.

Background

Driven by national "Healthy China" policies and rapid growth in the health‑insurance market, Taikang faced data silos caused by fragmented business‑line databases and large physical‑machine deployments. To overcome low data‑reuse efficiency and high management costs, a unified lakehouse platform was deemed essential.

Concepts

Definitions of Data Lake, Data Warehouse, and Lakehouse are clarified using Microsoft and IBM descriptions, highlighting the lakehouse’s combination of fast ingestion, multi‑modal data support, transactional guarantees, and centralized governance.

Technical Selection

Three open‑source lake components—Apache Iceberg, Apache Hudi, and Delta Lake—were evaluated across community momentum, feature set, and performance. Community metrics (stars, forks, PRs, issue resolution) favored Hudi, especially in China. Feature comparison showed Hudi best satisfied Taikang’s functional requirements, and benchmark tests on a 7400‑million‑record insurance dataset confirmed comparable performance to Delta Lake and superiority over Iceberg.

Selection Result

Active community with diverse contributors and strong development momentum.

Key lake features (fast ingestion, upserts, Flink integration) fully meet business needs.

Performance meets the platform’s throughput requirements.

Lakehouse Architecture

The architecture consists of five layers:

Data Sources: Enterprise DB2, other commercial databases, and Kafka for unstructured streams.

Processing Layer: Primarily Apache Flink (with some Spark), providing unified batch‑and‑stream processing and custom connectors (e.g., flink‑db2).

Infrastructure Layer: HDFS on physical machines plus Taikang Cloud OSS for object storage.

Lake Platform Layer: Apache Hudi as the transactional table format and streaming data‑lake service.

Data Modeling Layer: Tables built on Hudi, supporting upserts and schema evolution.

Data Access Layer: Hive Metastore (with Kerberos), Trino, ClickHouse, and REST APIs for discovery, governance, and permission control.

Data Application Layer: BI, ad‑hoc queries, visualizations, and downstream analytics.

Implementation Details

Version selection for all components is documented, and three typical challenges are addressed:

Synchronizing Hudi metadata with Hive Metastore and exposing it via Trino catalogs.

Mitigating small‑file explosion by using Hudi’s clustering (Merge‑On‑Read) after fast ingestion.

Enabling Kerberos authentication for Hudi by patching source code to interoperate with secured HDFS and Hive.

Custom Extensions to Hudi

Two domain‑specific enhancements were developed:

Multi‑field primary‑key upserts: Allows a single upsert to update only the relevant field groups (e.g., policy, pension, dental) without overwriting other columns, reducing reliance on Flink state.

Multi‑event‑time validation: Guarantees that the latest record is persisted even when out‑of‑order events arrive, supporting strict accuracy requirements in health‑insurance data.

Use‑Case and Results

The extensions were validated on the "real‑time policy acceptance" scenario, processing >50 k policy updates daily (≈600 k total record operations) with 100 % data‑accuracy and no loss, demonstrating the value of the lakehouse approach.

Since production, the platform manages ~300 TB of data, >100 streaming jobs, >1 200 ETL tasks, and supports diverse analytics such as user behavior, compliance, OLAP, and visualization.

Future Work

Planned directions include expanding component integration for ML/DL workloads, strengthening monitoring, fault‑tolerance, and disaster recovery, and further tailoring Hudi to the unique characteristics of the big‑health domain.

Conclusion

The article provides a comprehensive case study of building a lakehouse data platform with Apache Hudi, covering selection, architecture, implementation, custom development, and measurable business impact, offering practical insights for similar large‑scale data initiatives.

Big DataFlinkdata platformdata governanceApache HudiData LakehouseHealth Insurance
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.