Big Data 17 min read

Apache Spark 3.0.0 Release: New Features, Improvements, and Timeline

Apache Spark 3.0.0, released after a 21‑month development cycle and several preview and release‑candidate votes, introduces major enhancements such as Dynamic Partition Pruning, Adaptive Query Execution, accelerator‑aware scheduling, DataSource V2, expanded pandas UDFs, new join hints, richer monitoring, SparkR vectorization, Kafka header support, and broader ecosystem integrations, while fixing over 3,400 issues.

Big Data Technology Architecture

Jun 20, 2020

Apache Spark 3.0.0 Release: New Features, Improvements, and Timeline

Apache Spark 3.0.0 was officially released just before the Spark Summit AI conference, concluding a 21‑month development effort that included two preview releases and three release‑candidate votes.

2019‑11‑06: First preview release (Preview release of Spark 3.0) [1]

2019‑12‑23: Second preview release (Preview release of Spark 3.0) [2]

2020‑03‑21: RC1 vote [3]

2020‑05‑18: RC2 vote [4]

2020‑06‑06: RC3 vote [5]

The new version adds more than 3,400 resolved issues and brings a host of exciting features:

Dynamic Partition Pruning

Adaptive Query Execution (AQE)

Accelerator‑aware Scheduling (GPU/FPGA support)

DataSource V2 API

Vectorization in SparkR

Support for Hadoop 3, JDK 11, Scala 2.12, etc.

Dynamic Partition Pruning

Runtime‑based partition pruning reduces unnecessary data scans. For example, the following query benefits from this optimization:

SELECT * FROM dim_iteblog
JOIN fact_iteblog
ON (dim_iteblog.partcol = fact_iteblog.partcol)
WHERE dim_iteblog.othercol > 10

By pruning fact_iteblog rows at runtime, query scan size drops dramatically, yielding up to 33× speed‑up and 2‑18× acceleration on 60 of 102 TPC‑DS queries.

Adaptive Query Execution (AQE)

AQE allows Spark to adjust execution plans at runtime based on statistics, providing three capabilities: dynamic shuffle‑partition merging, dynamic join‑strategy selection, and dynamic skew‑join optimization. Enabling it with spark.sql.adaptive=true can improve queries such as q77 by 8× and q5 by 2× on a 1 TB TPC‑DS benchmark.

Accelerator‑aware Scheduling

Native GPU/FPGA scheduling is added to Spark, with support in YARN and Kubernetes. The implementation requires cluster‑manager upgrades to expose GPU resources and scheduler modifications to allocate GPUs to tasks.

DataSource V2

DataSource V2 removes dependencies on higher‑level APIs, enhances extensibility, and supports column pruning, filter push‑down, and transactional writes. It has been stabilized for Spark 3.0 and is a major new feature of this release.

Enhanced pandas UDFs

Spark 3.0 introduces new pandas UDF types (iterator‑of‑series to iterator‑of‑series, iterator‑of‑multiple‑series to iterator‑of‑multiple‑series) and three new APIs (grouped map, map, co‑grouped map) with Python type hints.

Join Hints

New join hints SHUFFLE_MERGE, SHUFFLE_HASH, and SHUFFLE_REPLICATE_NL give users more control over join strategy selection.

Built‑in Functions

Scala API adds 32 new built‑in and higher‑order functions, including a suite of MAP‑specific functions (transform_key, transform_value, map_entries, map_filter, map_zip_with).

Monitoring Enhancements

Redesigned Structured Streaming UI with aggregate and detailed metrics.

Enhanced EXPLAIN command with FORMATTED mode and plan dump capability.

Observable metrics that emit named events with aggregated data after each query stage.

Better ANSI SQL Compatibility

Efforts (SPARK‑27764) aim to close gaps between Spark SQL and PostgreSQL/ANSI SQL 2011, addressing 231 sub‑issues.

SparkR Vectorization

SparkR now uses Apache Arrow for vectorized reads/writes, eliminating JVM serialization overhead and delivering thousand‑fold performance gains.

Kafka Streaming: includeHeaders

Support for Kafka message headers (KIP‑82) is added. Example usage:

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .option("includeHeaders", "true")
  .load()

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers")
  .as[(String, String, Map)]

Additional improvements include GPU support on standalone, YARN, and Kubernetes; removal of Scala 2.11 and Python 2 support; Hadoop 3.2 compatibility; Spark Graph Cypher query language; and event‑log rolling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Apache Spark Adaptive Query Execution Dynamic Partition Pruning DataSource V2 Kafka Headers Spark 3.0 SparkR Vectorization

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.