Big Data 18 min read

Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi

This article shares Xiaomi's experience building a next‑generation one‑stop data development platform on Spark 3.1, covering typical challenges such as Multiple Catalog implementation, Hive‑SQL to Spark‑SQL migration, offline Spark performance and stability optimizations, and future roadmap plans.

DataFunSummit

Aug 14, 2024

Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi

Overview – Apache Spark is a widely used offline big‑data engine. Xiaomi built a new one‑stop data development platform on Spark 3.1 and encountered several issues during job migration, performance tuning, and stability improvement. The following sections describe typical problems and the solutions adopted by the Xiaomi Spark team.

Agenda

Multiple Catalog implementation and usage

Hive SQL migration to Spark SQL

Offline Spark stability and performance optimization

Future roadmap

Q&A

1. Multiple Catalog Implementation and Application

Before 2021, Xiaomi operated several data‑development platforms that only supported Hive data sources. Other sources (MySQL, Kudu, etc.) had to be registered as Hive external tables, leading to fragmented data management and outdated engines.

To address this, the new data platform adopted Spark 3.1 (which offers ~2× performance over Spark 2.4) and introduced a unified metadata service based on Metacat. Metacat provides real‑time metadata without secondary storage, solving table‑name mismatches and schema‑drift problems.

Using Spark 3.1’s CatalogPlugin, Xiaomi built multiple catalogs that allow a single SQL statement to query Doris, Hive, Iceberg, and other sources simultaneously, enabling federated analysis.

2. Hive SQL Migration to Spark SQL

During the platform rollout, Xiaomi encouraged users to migrate legacy Hive/ Spark‑2 jobs to Spark 3.1. After a year, Spark‑3 SQL accounted for 51% of jobs, while Hive‑SQL and Spark‑2 SQL still held a large share.

The migration workflow consists of four steps: syntax checking, data‑consistency verification, automated bulk upgrade of SQL engine version, and post‑upgrade monitoring.

Syntax checking uses Kyuubi’s *Plan‑Only* mode, which parses and analyzes SQL without execution. The resulting LogicalPlan reveals syntax differences between engines.

Data‑consistency verification runs the original and upgraded queries in parallel, comparing row counts and column hash values without persisting intermediate results.

Typical syntax‑only statements that still require execution are:

USE db

CREATE TEMPORARY VIEW ...

CREATE TEMPORARY FUNCTION ...

For timestamp fields (e.g., CURRENT_TIMESTAMP()) the verification step skips those columns to avoid false mismatches.

3. Offline Spark Stability and Performance Optimization

3.1 Performance Optimizations

The team focused on data skipping, leveraging Iceberg’s Parquet format with predicate push‑down, column pruning, and row‑group pruning to reduce I/O during the Scan phase.

When partition predicates involve functions, the optimizer extracts the partition column and pushes it down to the source, dramatically improving query speed. Row‑group level skipping sometimes reads entire row groups even if only a few rows match; the team reordered predicates so push‑down‑able filters run first, achieving up to 93% performance gain on heavily filtered queries.

Further gains were obtained by moving from row‑group to page‑level min‑max indexes, which can prune data at the page granularity when the partition column is ordered.

Join optimizations included:

Prefer BroadcastHashJoin for small tables to avoid shuffle.

Replace size‑based broadcast decisions with Adaptive Query Execution (AQE)‑based shuffle size metrics, preventing driver OOM caused by inaccurate size estimates.

For tables slightly larger than the broadcast threshold, switch from SortMergeJoin to ShuffledHashJoin to eliminate costly sort phases, yielding an average 14% query speedup (up to 67% on ordered data).

3.2 Stability Optimizations

During Iceberg writes, excessive small files caused executor OOM because each task created many writers. The solution added a repartition step before the insert to co‑locate data of the same partition, combined with Spark AQE’s dynamic partition merging to reduce shuffle partitions.

Configuration isolation was introduced: the final stage received a dedicated partition‑merge threshold, leaving earlier stages’ parallelism untouched.

Ranger‑based unified permission management initially cached all permission records on the driver, leading to a 68% throughput drop and OOM risk when records grew to ~100k. The team migrated to a centralized Ranger service, restoring driver throughput to 95% and greatly reducing OOM risk.

4. Future Roadmap

Upcoming work includes deploying a vectorized execution engine based on Spark 3.3 + Gluten 0.5 + Velox. Internal benchmarks show 69% performance improvement on TPC‑DS recompute workloads, 44% on heavy‑I/O workloads, and an overall 43% boost across all TPC‑DS cases.

Additional plans aim to enhance user experience by automatically optimizing mis‑configured jobs.

5. Q&A

Q1: Does federated query always use Spark SQL syntax?

A1: Federated queries choose the most suitable engine at runtime (Presto for small data, Spark for large data). Syntax validation is unified under Spark SQL.

Q2: Does Xiaomi use both Iceberg and Hudi?

A2: Only Iceberg is adopted because of its broader compatibility (Flink, Spark) and better support for real‑time scenarios.

Q3: What compatibility issues arise when migrating Hive ETL to Spark?

A3: Floating‑point differences between Hive and Spark were observed; the team aligned behavior and performed syntax and data‑consistency checks before migration.

Q4: Is the repartition function used in Iceberg custom‑developed?

A4: It is provided by the Iceberg community.

Q5: Does adding bucket to repartition bring much benefit?

A5: Iceberg uses implicit partitioning; bucket maps values to specific buckets, helping to co‑locate data for efficient writes.

Q6: Any recommended configuration for testing Gluten?

A6: Tests are performed with the community‑provided configuration.

Thank you for your attention.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Data Platform Apache Spark SQL Migration

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.