Spark 3.4 New Features Overview: Community Updates, SQL Enhancements, PySpark, Streaming, and AI Ecosystem
This article presents a comprehensive overview of Spark 3.4, covering community growth statistics, major SQL improvements such as default column values and timestamp handling, new PySpark and streaming capabilities, and the emerging AI ecosystem that integrates natural‑language interfaces and Spark AI services.
The presentation begins with an introduction to Databricks, the company behind Apache Spark, and highlights the rapid growth of the Spark community over the past decade, including billions of Maven downloads, hundreds of thousands of Stack Overflow questions, and contributions from thousands of developers worldwide.
Community Updates showcase key metrics: over 1 billion Maven downloads, more than 100 k Stack Overflow questions, 3 600+ GitHub contributors, 100+ data sources, 440 k commits, and usage across 200+ countries.
SQL Features introduced in Spark 3.4 include:
Setting default values for table columns, with support for CSV, JSON, ORC, and Parquet formats.
New TIMESTAMP_NTZ type to avoid timezone‑dependent semantics.
Lateral column alias references, allowing a column alias to be used later in the same SELECT.
Parameterized SQL queries to reduce injection risk.
Bloom filter joins enabled by default to reduce shuffle I/O for large joins.
The OFFSET clause for pagination.
Table‑valued generator functions (e.g., EXPLODE ) now easier to read.
GROUP BY ALL and ORDER BY ALL to automatically include all grouping or ordering columns.
PySpark and Streaming enhancements feature:
Spark Connect : a lightweight client that sends query plans via gRPC to a long‑running Spark driver, enabling use from IDEs, Jupyter notebooks, and other environments.
Python UDF memory profiling to monitor memory usage per UDF execution.
Asynchronous progress tracking for micro‑batch streaming to overlap log commit work with computation.
Python arbitrary stateful processing APIs for custom streaming state handling.
AI Ecosystem introduces Spark AI, which allows users to describe data processing tasks in natural language; the system generates DataFrames, performs transformations, visualizations, and even creates UDFs. It also provides explain and verification APIs to validate generated pipelines.
The article concludes with a preview of future directions, such as expanding DataFrame operations, improving text‑to‑SQL generation, adding user‑defined table‑value functions, and facilitating test‑case generation, inviting readers to explore Spark AI at http://pySpark.ai .
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.