Big Data 21 min read

Apache Spark at iQIYI: Current Status and Optimization

iQIYI now relies on Apache Spark as its main offline engine, processing over 200 000 daily tasks for ETL, data synchronization and analytics, while recent optimizations—dynamic resource allocation, adaptive query execution, compression, rebalance, Z‑order and resource‑governance—have cut compute usage by ~27 %, storage by up to 76 % and improved query speed, completing a large‑scale migration from Hive and paving the way for Spark 3.4 and Iceberg support.

iQIYI Technical Product Team

Sep 15, 2023

Apache Spark at iQIYI: Current Status and Optimization

Apache Spark is the primary offline computing framework used in iQIYI's big‑data platform, supporting data processing, data synchronization and ad‑hoc data analysis.

Data processing: developers submit Spark‑Jar or Spark‑SQL jobs for ETL. Data synchronization: the internally developed BabelX tool, built on Spark, enables configurable, fully‑managed data exchange among 15+ sources (Hive, MySQL, MongoDB, etc.) across multiple clusters and clouds. Data analysis: analysts use the Magic Mirror platform to submit SQL or metric queries, which are routed through the Pilot gateway to Spark‑SQL services.

Currently more than 200 000 Spark tasks run daily, consuming over half of the company’s big‑data compute resources. During a recent platform upgrade the Spark service underwent version iteration, service optimization, task SQL‑ification and resource‑cost governance, resulting in significant efficiency gains and resource savings.

Key framework optimizations include:

Dynamic Resource Allocation (DRA) introduced in Spark 2.4.3, reducing task resource consumption by ~20%.

Adaptive Query Execution (AQE) enabled from Spark 3.0, automatically choosing join strategies and merging small partitions, improving overall performance by ~10%.

Dynamic Partition Pruning (DPP) with a rule to disable it when a query contains more than five sub‑queries, avoiding severe SQL‑parsing slowdown.

Additional enhancements:

Concurrent write support via a forceUseStagingDir parameter to avoid data loss when multiple tasks write to different partitions of the same table.

Support for querying sub‑directories by enabling recursiveFileLookup for partitioned tables.

JDBC source improvements: push‑down of shard conditions, multiple write modes (Normal, Upsert, Ignore), silent mode, Map‑type support, and local‑disk write‑size monitoring.

Resource governance is performed through:

Auditing task metrics via Prometheus and Spark EventLog to identify memory waste and low CPU utilization.

Dynamic allocation and parameter tuning (memory, CPU) that saved ~27% of compute resources and generated over 1 600 tickets.

The Spark SQL service has evolved from the native Thrift Server to Kyuubi 0.7 and now Apache Kyuubi 1.4, becoming iQIYI’s main offline engine with ~150 000 SQL tasks per day.

SQL‑service optimizations include:

Switching default compression to Zstandard, achieving a 3.3× compression ratio and 76% storage savings in advertising data.

Inserting a Rebalance stage before writes to control output file size, increasing average file size from 10 MB to 262 MB and eliminating small‑file proliferation.

Enabling Repartition sort inference to keep data distribution consistent, improving compression after Rebalance.

Applying Z‑order clustering, reducing storage by 13% and speeding up queries by 15%.

Final‑stage AQE configuration that raises spark.sql.adaptive.advisoryPartitionSizeInBytes for earlier stages while keeping it low for the final stage, cutting overall execution time by 25% and saving ~9% of resources.

Detecting dynamic‑write single‑partition jobs to avoid creating an oversized shuffle partition.

Reliability features added:

Large‑query interception based on scanned partitions and data volume.

Data‑inflation monitoring using Spark UI metrics and custom listeners.

Join‑skew detection via a TopNAccumulator in SortMergeJoinExec, exposing the most frequent join keys.

Hive‑to‑Spark migration compatibility work covered:

UDF handling (e.g., reflect returns NULL on error, matching Hive behavior).

Support for Hive UDAF private constructors.

Alignment of built‑in functions such as GROUPING_ID and hash algorithms.

Type‑conversion strictness adjustments (ANSI SQL compliance, legacy mode, map‑key dedup policy).

Hint and DDL syntax compatibility, with custom plugin extensions for partition operations.

In summary, iQIYI has migrated the vast majority of Hive workloads to Spark, making Spark the primary offline processing engine. Ongoing work includes upgrading to Spark 3.4 to support Iceberg data‑lake features, further performance research on Uniffle remote shuffle integration, and continued optimization of the Spark compute framework.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Data Lake Apache Spark resource governance SQL Service

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.