Big Data 11 min read

How Uber Upgraded Over 2 Million Spark Jobs from 2.4 to 3.3

Uber migrated more than two million daily Spark applications from version 2.4 to 3.3, detailing the motivations, architecture, four-step migration process, custom tools like Polyglot Piranha and Iron Dome, and the resulting performance, cost, and productivity gains.

Past Memory Big Data
Past Memory Big Data
Past Memory Big Data
How Uber Upgraded Over 2 Million Spark Jobs from 2.4 to 3.3

Introduction

Uber runs over 2 million Spark applications each day, making it one of the largest Spark deployments worldwide. All workloads originally used Spark 2.4. This article explains how Uber migrated to Spark 3.3 and the automation built to support the move.

Why Upgrade to Spark 3.3

Leverage Spark 3.3 optimizations such as Adaptive Query Execution and Dynamic Partition Pruning to improve efficiency and cut costs.

Address known CVEs and improve security.

Benefit from Koalas™ (pandas on PySpark) to boost developer productivity.

Adopt additional optimizations like Apache Gluten™ and Velox.

Stay aligned with the latest open‑source contributions.

Architecture Overview

Spark jobs at Uber are written in Java®, Scala, or Python and are scheduled on either Apache Hadoop® YARN or Kubernetes®. Jobs run either as part of scheduled Piper pipelines or via ad‑hoc CLI/script execution. Code resides in Uber’s monorepo or micro‑repos.

Submission is handled by Drogon, an Apache Livy™ proxy that integrates with Jupyter® notebooks, PySpark, Java applications, and Spark Shell. Jobs are submitted to YARN or Kubernetes based on user preference. Uber also uses an internal shuffle manager, Zeus, with fallback to an external shuffle service (ESS).

Four‑Step Migration Process

Prepare Binaries – Synchronize Uber’s custom Spark branch with the upstream open‑source version and incorporate internal extensions such as task‑level event listeners, column‑level data lineage, and the Zeus shuffle manager.

Achieve Ecosystem Compatibility – Resolve dependency gaps in the Spark ecosystem (e.g., libraries, connectors) before moving to the new version.

Adapt Source Code – Use Uber’s open‑source tool Polyglot Piranha to parse source code into an abstract syntax tree (AST), apply structured search rules, and automatically rewrite patterns that need changes for Spark 3.

Data Validation & Shadow Testing – Run extensive validation and shadow tests to compare behavior between Spark 2.4 and 3.3.

Polyglot Piranha Workflow

Parse source code into an AST and locate target patterns with structured search rules.

Apply a predefined set of transformation rules once patterns are detected.

Define additional rules to adapt code to Spark 3 semantics.

Example: the tool automatically adds legacy configuration required by Spark 3 during SparkConf initialization.

Validation Challenges

More than 40 000 Spark applications make decentralized validation infeasible.

No pre‑production environment; running “dry‑run” jobs in production could affect live data.

Lack of existing test cases and baseline data for validation.

Iron Dome Framework

Because open‑source tools could not meet Uber’s validation needs, the team built Iron Dome , which provides a safe sandbox for shadow testing.

Secure Rewrite – Intercept Spark’s Catalog API and Hadoop’s File Output Committer to rewrite production paths (e.g., /db/tbl/stgdb/tbl) during testing.

Protection Mechanism – Add safeguards at the Hadoop FileSystem layer to prevent accidental writes to production locations.

Verification Infrastructure – Capture tables and paths accessed by a job and emit telemetry to a message queue for comparison with production results.

Task Orchestration – Use Uber’s Cadence workflow engine to automate shadow testing, data verification, and migration tagging for all jobs.

Results and Impact

85% migration rate – Over 20 000 tasks were migrated to Spark 3 within six months using Iron Dome and automated orchestration.

Performance gains – More than 60% of tasks saw efficiency improvements exceeding 10%, saving millions of dollars in compute costs.

Productivity boost – Automation eliminated manual migration effort, saving thousands of engineer hours.

The migration also opened the path for future upgrades, such as support for Kubernetes, JDK 17, and the upcoming Spark 4 release. Uber has open‑sourced several patches and plans to apply the same automation framework to other infrastructure upgrades.

Conclusion

Uber successfully upgraded 100% of its Spark applications to version 3.3, cutting overall runtime and resource consumption by 50%. The custom tools and processes developed for this effort—Polyglot Piranha, Iron Dome, and Cadence‑driven orchestration—provide a reusable blueprint for large‑scale Spark version upgrades.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data EngineeringkubernetesApache SparkIron DomePolyglot PiranhaSpark 3.3
Past Memory Big Data
Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.