Big Data 15 min read

Overview of New Features and Improvements in Apache Spark 3.0

Apache Spark 3.0 introduces a suite of performance enhancements, richer APIs, improved monitoring, SQL compatibility, new data sources, and ecosystem extensions, including Adaptive Query Execution, Dynamic Partition Pruning, Join Hints, pandas UDF improvements, and accelerator‑aware scheduling, to boost scalability and ease of use for big‑data workloads.

Big Data Technology Architecture

Aug 12, 2020

Overview of New Features and Improvements in Apache Spark 3.0

1. Performance

Key performance‑related features added in Spark 3.0 are Adaptive Query Execution (AQE), Dynamic Partition Pruning, Query Complication Speedup, and Join Hints.

(1) Adaptive Query Execution

AQE addresses limitations of earlier rule‑based and cost‑based optimizers by using runtime statistics to re‑optimize query plans. It can convert Sort‑Merge Joins to Broadcast Hash Joins, reduce the number of reducers based on intermediate data size, and handle skewed joins by splitting large partitions.

(2) Dynamic Partition Pruning

Dynamic Partition Pruning uses intermediate query results to avoid reading unnecessary partitions, which is especially effective for star‑schema data warehouses and can accelerate TPC‑DS workloads by 2‑18×.

(3) Join Hints

Join Hints let users influence the join strategy (e.g., Broadcast Hash Join, Sort‑Merge Join, Shuffle Hash Join). While powerful, they must be used cautiously because data characteristics can change, making a previously optimal hint sub‑optimal.

2. Richer APIs

Spark 3.0 adds several developer‑friendly APIs, including an accelerator‑aware scheduler, a set of built‑in functions, pandas UDF enhancements, and support for DELETE/UPDATE/MERGE in Catalyst.

(1) pandas UDF Enhancements

pandas UDFs now support Python type hints, allowing users to specify input and output types with pandas.Series, simplifying development and improving performance.

(2) Accelerator‑Aware Scheduler

The scheduler now understands GPU resources, enabling accelerator‑aware job and stage scheduling and providing a Web UI to monitor GPU usage.

(3) Built‑in Functions

Thirty‑two new built‑in functions (e.g., map_keys, map_values) are added, reducing the need for custom UDFs and improving execution speed.

3. Monitoring and Debuggability

New monitoring features include a Structured Streaming UI, DDL/DML enhancements (e.g., EXPLAIN FORMATTED), observable metrics for data quality, and event‑log rollover.

(1) Structured Streaming UI

The UI shows completed and ongoing streaming query metrics such as input rate, processing rate, batch duration, and operation duration.

(2) DDL/DML Enhancements

EXPLAIN now supports a FORMATTED mode that provides a concise tree view followed by detailed explanations for each plan component.

(3) Observable Metrics

Observable metrics allow users to track data‑quality indicators, which is especially important for streaming workloads.

4. SQL Compatibility

Spark 3.0 improves ANSI compliance with features such as ANSI Store Assignment, overflow checking, reserved keyword handling, and a proleptic Gregorian calendar.

5. Built‑in Data Sources

Enhanced support for Parquet, CSV, and a new Binary data source enables column pruning and filter push‑down for nested columns and binary files.

6. Extensibility and Ecosystem

Spark 3.0 continues to evolve its ecosystem with improvements to the Data Source V2 API and catalog support, Java 11, Hadoop 3, and Hive 3 compatibility. The Koalas project, which provides a pandas‑like API on Spark, also sees rapid adoption.

For more details, refer to the official Spark documentation and the UI guide at https://spark.apache.org/docs/latest/web-ui.html and the SQL reference at https://spark.apache.org/docs/latest/sql-ref.html.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Performance Optimization Apache Spark Adaptive Query Execution Spark 3.0

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.