Big Data 9 min read

Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0

This article explains how Apache Spark 3.0 improves SQL workload performance through Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP), detailing their design principles, runtime optimizations, configuration parameters, and practical examples that demonstrate reduced shuffle partitions, smarter join strategies, and handling of data skew.

Big Data Technology Architecture

Nov 16, 2021

Apache Spark has become the de‑facto distributed computing framework for complex data processing, and Spark 3.0 introduces two major performance features: Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP), both aimed at optimizing Spark SQL workloads.

Adaptive Query Execution

Early implementations of Spark Catalyst used a fixed number of shuffle partitions (default 200), which often led to sub‑optimal performance and many small output files. Users could manually set spark.sql.shuffle.partitions, but this approach is cumbersome, error‑prone, and applies globally to all shuffle operations.

AQE solves these problems by collecting precise runtime statistics and adjusting the query plan at the granularity of Query Stages . A Query Stage is defined by a shuffle or broadcast exchange; once a stage finishes, all partition statistics become available, providing an ideal point for runtime optimization.

Built on top of Catalyst, AQE can modify the physical plan during execution. When enabled, it automatically coalesces shuffle partitions based on actual data size, removing the need for the static 200‑partition default.

For example, running SELECT max(i) FROM tbl GROUP BY j with an initial shuffle partition count of 5 would normally launch five tasks. AQE detects three tiny partitions and merges them, reducing the final aggregation to three tasks instead of five.

AQE also dynamically converts Sort‑Merge Joins to Broadcast Joins when runtime statistics show that one side of the join fits in memory, improving join performance and reducing network traffic.

Furthermore, AQE includes a skew‑join optimizer that detects heavily skewed partitions from shuffle file statistics, splits them into smaller sub‑partitions, and redistributes the work, balancing task execution times.

Key configuration parameters for skew handling are spark.sql.adaptive.skewJoin.enabled , spark.sql.adaptive.skewJoin.skewedPartitionFactor , and spark.sql.adaptive.skewedPartitionThresholdInBytes .

Dynamic Partition Pruning

DPP is the second major optimization in Spark 3.0. When enabled, it prunes unnecessary partitions of fact tables based on filter predicates derived from dimension tables, both at the logical and physical plan levels.

At the logical level, a sub‑query filters the dimension table and applies the resulting predicate before scanning the fact table. At the physical level, the filter is executed on the dimension side and broadcast to the fact side, avoiding scans of irrelevant partitions.

In TPC‑DS benchmarks, DPP accelerated 60 out of 102 queries by factors ranging from 2× to 18×.

Combined with AQE, GPU acceleration, and Kubernetes orchestration, Spark 3.0 promises substantial performance gains for a growing number of organizations.

Note: This article is translated from "How does Apache Spark 3.0 increase the performance of your SQL workloads" (Cloudera blog).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL optimization Spark Adaptive Query Execution Dynamic Partition Pruning

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.