Big Data 9 min read

Halodoc’s Data Platform Evolution: From Redshift to a LakeHouse Architecture with Apache Hudi

This article describes how Halodoc’s data engineering team identified limitations of their Redshift‑based platform, evaluated a LakeHouse design, selected Apache Hudi for mutable data handling, and outlined the challenges and benefits of building a scalable, decoupled storage‑compute architecture for their growing healthcare services.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Halodoc’s Data Platform Evolution: From Redshift to a LakeHouse Architecture with Apache Hudi

1. Abstract

Data platforms have fundamentally changed how companies store, analyze, and use data, but to be more effective they must be reliable, high‑performance, and transparent. As Indonesia’s largest online healthcare provider, Halodoc faced major challenges in democratizing data across the organization, especially as data volume grew exponentially and existing Redshift‑based pipelines showed latency, governance, and scalability issues.

2. Platform Evolution

The legacy platform relied on periodic ELT jobs that moved data into Redshift, where storage and compute were tightly coupled, leading to high costs and 3‑4 hour data latency. Additional problems included lack of fine‑grained data governance, invisible data‑set lineage, missing slowly changing dimension (SCD) handling, reliance on Airflow for in‑memory data movement, absence of a data catalog, no integrated data lineage, duplicated pipeline code, and manual schema evolution.

Coupled storage and compute : Scaling storage automatically increased compute cost.

High latency : Data arrived 3‑4 hours after generation.

Insufficient governance : Only group‑level access, no column/row‑level controls.

Opaque data‑set creation : Duplicate tables and unclear relationships.

No SCD management : Historical changes for attributes like drug price were not tracked.

Airflow memory‑bound moves : Airflow is not a distributed processing engine.

No data catalog : Users could not discover metadata for Redshift tables.

Lack of lineage : No visibility into source‑to‑target transformations.

Framework‑driven platform missing : Repeated code across pipelines, hard to decouple components.

Manual schema evolution : DBA‑driven changes were error‑prone.

Recognizing these constraints, the team decided to redesign the platform from scratch, adopting a LakeHouse architecture to achieve cost‑effective scalability and handle massive data volumes.

3. Why Adopt a LakeHouse Architecture?

A LakeHouse combines the flexibility of a data lake with the reliability of a data warehouse, enabling seamless data movement, unified security, and independent scaling of storage and compute. Core requirements identified were:

Decoupled storage and compute for high scalability.

Support for structured, semi‑structured, and unstructured data.

A single source of truth for the organization.

Ability to store and query mutable and immutable data.

Integration with distributed processing engines such as Spark or Hive.

The new design uses Amazon S3 as the data lake, providing virtually unlimited storage. To handle mutable data on S3, the team evaluated Iceberg, Delta Lake, and Apache Hudi, ultimately selecting Hudi for its tight integration with EMR.

4. Why Choose Apache Hudi

Supports upserts on files.

Captures change history via CDC.

Provides ACID guarantees.

Works with both Copy‑on‑Write and Merge‑on‑Read storage types.

Enables real‑time, snapshot, and incremental queries.

Offers time‑travel capabilities.

Pre‑installed on EMR for out‑of‑the‑box use.

5. Summary

The blog outlined the limitations of the existing platform and introduced the motivations for moving to a LakeHouse built on Apache Hudi. Future posts will dive deeper into the LakeHouse design, implementation details, and the operational challenges encountered while rolling out the new platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data engineeringData PlatformApache Hudi
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.