Overview of SQL Performance Improvements in Apache Spark 3.0
Apache Spark 3.0 introduces extensive SQL performance enhancements, including a new explain format, expanded join hints, adaptive query execution, dynamic partition pruning, enhanced nested column pruning, improved aggregation code generation, and support for newer Scala and Java versions, all aimed at optimizing query planning and execution.
Alibaba senior technical expert Li Chengxiang presents an overview of SQL performance improvements in Apache Spark 3.0. The content is compiled from the Spark+AI Summit Chinese highlights.
Original video link: https://developer.aliyun.com/live/43188
Today the speaker shares the SQL‑related optimizations in Spark 3.0. Since Spark 2.4, more than a year and a half of work has been done, covering many feature enhancements and performance tweaks; over 50% of the related issues are SQL‑centric. The work is grouped into four categories: developer tools, dynamic optimization, Catalyst optimizer enhancements, and core dependency updates.
Spark 3.0 is a long‑running release that incorporates a massive amount of community work—about 3,400 issues were resolved. Users need to understand which new features and improvements are relevant for production workloads.
In total the SQL improvements can be divided into seven parts, which belong to the four categories mentioned earlier.
1. New Explain Format – To tune Spark SQL performance, a readable query plan is essential. Spark 2.4’s EXPLAIN output is verbose and hard to read; Spark 3.0 introduces a concise format that groups node information into input, output, condition, etc., and assigns numeric IDs for easy lookup.
In Spark 2.4 the plan is displayed as a complex tree, which is less readable.
Spark 3.0’s new format presents the plan in a compact way, with node numbers linking to detailed information and categorized fields.
2. All Types of Join Hints – Spark 2.4 only supported broadcast hints; Spark 3.0 adds support for sort‑merge, shuffle‑hash, and cartesian join hints.
3. Adaptive Query Execution (AQE) – AQE enables runtime statistics to choose optimal plans, improving performance by adjusting reducer numbers, selecting the best join strategy, and handling data skew automatically.
Dynamic reducer adjustment: In Spark 2.4 partitions are fixed, leading to imbalanced data distribution.
In Spark 3.0, small partitions are coalesced so each reducer processes a similar amount of data.
Data‑skewed joins in Spark 2.4 cause long processing times for the largest partition.
In Spark 3.0 the skewed keys are split and processed by multiple tasks, greatly speeding up the join.
4. Dynamic Partition Pruning – This feature avoids reading unnecessary partitions during joins by pushing down dynamic filters.
In Spark 2.4 the entire large table is read.
In Spark 3.0, push‑down with dynamic filters reduces the amount of data read from the large table.
Example of dynamic partition pruning is shown below.
5. Enhanced Nested Column Pruning & Push‑down – Spark 2.4 offered limited support for pruning nested columns and pushing down filters. Spark 3.0 extends this capability to all operators, allowing column pruning and filter push‑down through the entire query plan.
Further optimization in Spark 3.0 enables column pruning to penetrate every operator.
Filter conditions on nested fields could not be fully pushed down in Spark 2.4.
In Spark 3.0, such filters are pushed down to the table scan level.
6. Improved Aggregation Code Generation – Spark’s code‑generation for aggregation is limited by an 8,000‑byte Java method size; complex SQL could exceed this limit and fall back to interpreted mode. Spark 3.0 splits large methods into smaller ones to avoid the limit.
When a generated method exceeds 8,000 bytes, the HotSpot compiler rolls back.
Spark 3.0 mitigates this by breaking the method into multiple smaller methods.
Example of the split method is shown below.
7. New Scala and Java Support – Spark 3.0 adds support for Java 11 and Scala 2.12.
往期推荐
▬
面试必知的 Spark SQL 几种 Join 实现
使用Apache Hudi构建大规模、事务性数据湖
HBase比较高阶的调优指南
数据仓库、数据湖、流批一体,终于有大神讲清楚了!Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.