Big Data 18 min read

Applying and Practicing Apache Hudi on Tongcheng Elong: Architecture, Challenges, and Solutions

This article describes the background, design choices, and practical challenges of using Apache Hudi for data updates on the Tongcheng Elong platform, analyzes three architectural alternatives, details Hudi's core configurations and write strategies, and presents concrete solutions to version compatibility, upsert semantics, insert behavior, partition management, streaming backlog monitoring, and business‑specific requirements, culminating in a productized Hudi service and future roadmap.

Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Applying and Practicing Apache Hudi on Tongcheng Elong: Architecture, Challenges, and Solutions

Background : Traditional Hive data warehouses do not support updates, which is problematic for mutable data such as orders, members, and coupons. Three alternative solutions were evaluated: (1) MySQL binlog → Kafka → FlinkSQL → Hive with deduplication, (2) MySQL binlog → Kafka → FlinkSQL → Kudu (upsert) → Hive, and (3) daily full loads via Sqoop. Each approach has limitations regarding partition writes, file size, latency, and operational complexity.

Why Hudi? Hudi (and Delta Lake) enable mutable operations on HDFS files. Hudi offers broader ecosystem compatibility (Spark, Presto, MapReduce, Hive) and supports both copy‑on‑write (COW) and merge‑on‑read (MOR) write strategies, with automatic small‑file merging.

Core Hudi Configuration :

RECORDKEY_FIELD_OPT_KEY – unique record key.

PRECOMBINE_FIELD_OPT_KEY – field used to resolve conflicts when keys collide.

PARTITIONPATH_FIELD_OPT_KEY – partition field (updates across partitions are not supported).

Hudi provides two write modes: COW (read‑optimized) and MOR (write‑optimized using Avro and background compaction). The article focuses on the COW mode, where data is stored as Parquet files with per‑file Bloom filters for efficient key lookup.

Identified Problems :

Hive version incompatibility (Hudi requires Hive 2.x, but the cluster runs Hive 1.x).

Upsert semantics not honoring the pre‑combine key, causing older records to overwrite newer ones.

Insert semantics behaving like upsert, leading to unintended overwrites.

Uncontrolled Hive partition creation when timestamps are used as partition keys.

Difficulty monitoring Spark‑Streaming backlog because consumer groups are not fixed.

Business‑specific needs such as pre‑combine key collisions and null‑value filling.

Solution 1 – Hive Compatibility : Leveraged an existing migration path from Hive 1.x to Hive 2.x and switched the ETL engine from MapReduce to SparkSQL, allowing Hudi to run without upgrading the Hive service. Explored using the Hive metastore client while keeping the Hive server at 1.x.

Solution 2 – Fixing Upsert Semantics :

@Override
public OverwriteWithLatestAvroPayload preCombine(OverwriteWithLatestAvroPayload another) {
  // pick the payload with greatest ordering value
  if (another.orderingVal.compareTo(orderingVal) > 0) {
    return another;
  } else {
    return this;
  }
}

The above method correctly applies the pre‑combine key during the in‑memory aggregation.

public Option
combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema) throws IOException {
  Option
recordOption = getInsertValue(schema);
  if (!recordOption.isPresent()) {
    return Option.empty();
  }
  GenericRecord genericRecord = (GenericRecord) recordOption.get();
  Object deleteMarker = genericRecord.get("_hoodie_is_deleted");
  if (deleteMarker instanceof Boolean && (boolean) deleteMarker) {
    return Option.empty();
  } else {
    return Option.of(genericRecord);
  }
}

This method originally ignored the existing HDFS record; the fix adds a pre‑combine key comparison before returning the new record.

Solution 3 – Insert Semantics : Implemented a custom SimpleKeyGenerator that generates a UUID for each incoming record, ensuring unique row keys and preserving true insert behavior.

Solution 4 – Partition Creation Monitoring : Modified HiveSyncTool.syncPartitions to track total partition count and per‑batch partition creation, aborting the job when predefined thresholds are exceeded.

Solution 5 – Streaming Backlog Monitoring : Fixed a consumer‑group issue by generating a stable group ID per job and manually committing offsets in the onQueryProgress callback of a custom StreamingQueryListener , enabling standard Kafka/TurboMQ lag alerts.

Solution 6 – Business Requirements :

Resolved pre‑combine key collisions by combining the Kafka offset with the event timestamp to create a globally ordered key.

Implemented null‑value filling by extending HoodieRecordPayload to merge non‑null fields from older records into the latest version.

Hudi Service Productization : Built a three‑step UI‑driven workflow – Source → SQL → Sink – allowing users to configure data sources (Kafka/TurboMQ, JSON/CSV), perform Spark‑SQL transformations, and write to Hudi with selectable insert/upsert modes, row‑key, pre‑combine key, and partition fields. The sink automatically syncs to Hive, creating a queryable table.

Future Plans :

Enable Presto queries on Hudi tables.

Integrate metadata management to auto‑populate source schemas.

Add an online SQL testing console.

Support bulk data import into Hudi.

Develop Hudi on Flink for unified batch‑stream processing.

big datastreamingKafkaHivedata lakeSparkHudiUpsert
Tongcheng Travel Technology Center
Written by

Tongcheng Travel Technology Center

Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.