Applying Apache Hudi in Medical Big Data: Architecture, Synchronization, Storage Choices, and Future Directions
This article examines the use of Apache Hudi for building a hospital‑wide medical big‑data platform, covering construction background, reasons for selecting Hudi, data synchronization methods, storage mode choices, query optimizations, and future development considerations.
1. Construction Background
Our company builds a big‑data platform for hospitals, extracting data from many systems such as HIS, LIS, EMR, radiology, etc., which introduces challenges like heterogeneous source databases, unified data modeling, massive data volume variance, and real‑time requirements.
2. Why Choose Hudi
The previous solution used binlog → JSON → Kafka → Spark Streaming → HBase → DataX → Hadoop → Impala → Greenplum, which suffered from complex pipelines, difficult validation, storage redundancy, high query load, and latency.
Hudi was selected for its dual write modes (Copy‑On‑Write and Merge‑On‑Read), support for multiple query engines (Hive, Spark SQL, Presto, Impala), rich indexing (HBase, InMemory, Bloom, Global Bloom), and Parquet columnar storage with small‑file merging.
3. Hudi Data Synchronization
Data sync consists of offline full‑load using DataX with multithreaded JDBC extraction and online near‑real‑time sync where multiple tables write JSON to Kafka, Flink writes to HDFS partitions, and a service triggers Hudi merge jobs.
4. Storage Type Selection and Query Optimization
We adopted the Copy‑On‑Write mode to reduce query latency and leverage read‑optimized incremental views. Query performance is further tuned by Spark SQL partitioning, job parallelism, broadcast small tables, and avoiding data skew; Presto queries run about three times faster after enabling incremental view support.
5. Future Work and Thoughts
Plans include integrating FlinkX‑style offline sync, improving multi‑output Spark consumption, enhancing Hudi support in Flink, and deeper community involvement.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.