Big Data 9 min read

Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0

This article explains how Apache Spark 3.0 improves SQL workload performance through Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP), detailing their design principles, runtime optimizations, configuration parameters, and practical examples that demonstrate reduced shuffle partitions, smarter join strategies, and handling of data skew.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0

Apache Spark has become the de‑facto distributed computing framework for complex data processing, and Spark 3.0 introduces two major performance features: Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP), both aimed at optimizing Spark SQL workloads.

Adaptive Query Execution

Early implementations of Spark Catalyst used a fixed number of shuffle partitions (default 200), which often led to sub‑optimal performance and many small output files. Users could manually set spark.sql.shuffle.partitions , but this approach is cumbersome, error‑prone, and applies globally to all shuffle operations.

AQE solves these problems by collecting precise runtime statistics and adjusting the query plan at the granularity of Query Stages . A Query Stage is defined by a shuffle or broadcast exchange; once a stage finishes, all partition statistics become available, providing an ideal point for runtime optimization.

Built on top of Catalyst, AQE can modify the physical plan during execution. When enabled, it automatically coalesces shuffle partitions based on actual data size, removing the need for the static 200‑partition default.

For example, running SELECT max(i) FROM tbl GROUP BY j with an initial shuffle partition count of 5 would normally launch five tasks. AQE detects three tiny partitions and merges them, reducing the final aggregation to three tasks instead of five.

AQE also dynamically converts Sort‑Merge Joins to Broadcast Joins when runtime statistics show that one side of the join fits in memory, improving join performance and reducing network traffic.

Furthermore, AQE includes a skew‑join optimizer that detects heavily skewed partitions from shuffle file statistics, splits them into smaller sub‑partitions, and redistributes the work, balancing task execution times.

Key configuration parameters for skew handling are spark.sql.adaptive.skewJoin.enabled , spark.sql.adaptive.skewJoin.skewedPartitionFactor , and spark.sql.adaptive.skewedPartitionThresholdInBytes .

Dynamic Partition Pruning

DPP is the second major optimization in Spark 3.0. When enabled, it prunes unnecessary partitions of fact tables based on filter predicates derived from dimension tables, both at the logical and physical plan levels.

At the logical level, a sub‑query filters the dimension table and applies the resulting predicate before scanning the fact table. At the physical level, the filter is executed on the dimension side and broadcast to the fact side, avoiding scans of irrelevant partitions.

In TPC‑DS benchmarks, DPP accelerated 60 out of 102 queries by factors ranging from 2× to 18×.

Combined with AQE, GPU acceleration, and Kubernetes orchestration, Spark 3.0 promises substantial performance gains for a growing number of organizations.

Note: This article is translated from "How does Apache Spark 3.0 increase the performance of your SQL workloads" (Cloudera blog).

big dataSQL OptimizationSparkAdaptive Query ExecutionDynamic Partition Pruning
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.