Overview of New Features and Improvements in Apache Spark 3.0
Apache Spark 3.0 introduces a suite of performance enhancements, richer APIs, improved monitoring, SQL compatibility, new data sources, and ecosystem extensions, including Adaptive Query Execution, Dynamic Partition Pruning, Join Hints, pandas UDF improvements, and accelerator‑aware scheduling, to boost scalability and ease of use for big‑data workloads.
1. Performance
Key performance‑related features added in Spark 3.0 are Adaptive Query Execution (AQE), Dynamic Partition Pruning, Query Complication Speedup, and Join Hints.
(1) Adaptive Query Execution
AQE addresses limitations of earlier rule‑based and cost‑based optimizers by using runtime statistics to re‑optimize query plans. It can convert Sort‑Merge Joins to Broadcast Hash Joins, reduce the number of reducers based on intermediate data size, and handle skewed joins by splitting large partitions.
(2) Dynamic Partition Pruning
Dynamic Partition Pruning uses intermediate query results to avoid reading unnecessary partitions, which is especially effective for star‑schema data warehouses and can accelerate TPC‑DS workloads by 2‑18×.
(3) Join Hints
Join Hints let users influence the join strategy (e.g., Broadcast Hash Join, Sort‑Merge Join, Shuffle Hash Join). While powerful, they must be used cautiously because data characteristics can change, making a previously optimal hint sub‑optimal.
2. Richer APIs
Spark 3.0 adds several developer‑friendly APIs, including an accelerator‑aware scheduler, a set of built‑in functions, pandas UDF enhancements, and support for DELETE/UPDATE/MERGE in Catalyst.
(1) pandas UDF Enhancements
pandas UDFs now support Python type hints, allowing users to specify input and output types with pandas.Series, simplifying development and improving performance.
(2) Accelerator‑Aware Scheduler
The scheduler now understands GPU resources, enabling accelerator‑aware job and stage scheduling and providing a Web UI to monitor GPU usage.
(3) Built‑in Functions
Thirty‑two new built‑in functions (e.g., map_keys, map_values) are added, reducing the need for custom UDFs and improving execution speed.
3. Monitoring and Debuggability
New monitoring features include a Structured Streaming UI, DDL/DML enhancements (e.g., EXPLAIN FORMATTED), observable metrics for data quality, and event‑log rollover.
(1) Structured Streaming UI
The UI shows completed and ongoing streaming query metrics such as input rate, processing rate, batch duration, and operation duration.
(2) DDL/DML Enhancements
EXPLAIN now supports a FORMATTED mode that provides a concise tree view followed by detailed explanations for each plan component.
(3) Observable Metrics
Observable metrics allow users to track data‑quality indicators, which is especially important for streaming workloads.
4. SQL Compatibility
Spark 3.0 improves ANSI compliance with features such as ANSI Store Assignment, overflow checking, reserved keyword handling, and a proleptic Gregorian calendar.
5. Built‑in Data Sources
Enhanced support for Parquet, CSV, and a new Binary data source enables column pruning and filter push‑down for nested columns and binary files.
6. Extensibility and Ecosystem
Spark 3.0 continues to evolve its ecosystem with improvements to the Data Source V2 API and catalog support, Java 11, Hadoop 3, and Hive 3 compatibility. The Koalas project, which provides a pandas‑like API on Spark, also sees rapid adoption.
For more details, refer to the official Spark documentation and the UI guide at https://spark.apache.org/docs/latest/web-ui.html and the SQL reference at https://spark.apache.org/docs/latest/sql-ref.html.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.