Apache Spark 3.0.0 Release: New Features, Improvements, and Timeline
Apache Spark 3.0.0, released after a 21‑month development cycle and several preview and release‑candidate votes, introduces major enhancements such as Dynamic Partition Pruning, Adaptive Query Execution, accelerator‑aware scheduling, DataSource V2, expanded pandas UDFs, new join hints, richer monitoring, SparkR vectorization, Kafka header support, and broader ecosystem integrations, while fixing over 3,400 issues.
Apache Spark 3.0.0 was officially released just before the Spark Summit AI conference, concluding a 21‑month development effort that included two preview releases and three release‑candidate votes.
2019‑11‑06: First preview release (Preview release of Spark 3.0) [1]
2019‑12‑23: Second preview release (Preview release of Spark 3.0) [2]
2020‑03‑21: RC1 vote [3]
2020‑05‑18: RC2 vote [4]
2020‑06‑06: RC3 vote [5]
The new version adds more than 3,400 resolved issues and brings a host of exciting features:
Dynamic Partition Pruning
Adaptive Query Execution (AQE)
Accelerator‑aware Scheduling (GPU/FPGA support)
DataSource V2 API
Vectorization in SparkR
Support for Hadoop 3, JDK 11, Scala 2.12, etc.
Dynamic Partition Pruning
Runtime‑based partition pruning reduces unnecessary data scans. For example, the following query benefits from this optimization:
SELECT * FROM dim_iteblog
JOIN fact_iteblog
ON (dim_iteblog.partcol = fact_iteblog.partcol)
WHERE dim_iteblog.othercol > 10By pruning fact_iteblog rows at runtime, query scan size drops dramatically, yielding up to 33× speed‑up and 2‑18× acceleration on 60 of 102 TPC‑DS queries.
Adaptive Query Execution (AQE)
AQE allows Spark to adjust execution plans at runtime based on statistics, providing three capabilities: dynamic shuffle‑partition merging, dynamic join‑strategy selection, and dynamic skew‑join optimization. Enabling it with spark.sql.adaptive=true can improve queries such as q77 by 8× and q5 by 2× on a 1 TB TPC‑DS benchmark.
Accelerator‑aware Scheduling
Native GPU/FPGA scheduling is added to Spark, with support in YARN and Kubernetes. The implementation requires cluster‑manager upgrades to expose GPU resources and scheduler modifications to allocate GPUs to tasks.
DataSource V2
DataSource V2 removes dependencies on higher‑level APIs, enhances extensibility, and supports column pruning, filter push‑down, and transactional writes. It has been stabilized for Spark 3.0 and is a major new feature of this release.
Enhanced pandas UDFs
Spark 3.0 introduces new pandas UDF types (iterator‑of‑series to iterator‑of‑series, iterator‑of‑multiple‑series to iterator‑of‑multiple‑series) and three new APIs (grouped map, map, co‑grouped map) with Python type hints.
Join Hints
New join hints SHUFFLE_MERGE, SHUFFLE_HASH, and SHUFFLE_REPLICATE_NL give users more control over join strategy selection.
Built‑in Functions
Scala API adds 32 new built‑in and higher‑order functions, including a suite of MAP‑specific functions (transform_key, transform_value, map_entries, map_filter, map_zip_with).
Monitoring Enhancements
Redesigned Structured Streaming UI with aggregate and detailed metrics.
Enhanced EXPLAIN command with FORMATTED mode and plan dump capability.
Observable metrics that emit named events with aggregated data after each query stage.
Better ANSI SQL Compatibility
Efforts (SPARK‑27764) aim to close gaps between Spark SQL and PostgreSQL/ANSI SQL 2011, addressing 231 sub‑issues.
SparkR Vectorization
SparkR now uses Apache Arrow for vectorized reads/writes, eliminating JVM serialization overhead and delivering thousand‑fold performance gains.
Kafka Streaming: includeHeaders
Support for Kafka message headers (KIP‑82) is added. Example usage:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("includeHeaders", "true")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers")
.as[(String, String, Map)]Additional improvements include GPU support on standalone, YARN, and Kubernetes; removal of Scala 2.11 and Python 2 support; Hadoop 3.2 compatibility; Spark Graph Cypher query language; and event‑log rolling.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.