Big Data 14 min read

Spark 3.4 New Features Overview: Community Updates, SQL Enhancements, PySpark, Streaming, and AI Ecosystem

This article presents a comprehensive overview of Spark 3.4, covering community growth statistics, major SQL improvements such as default column values and timestamp handling, new PySpark and streaming capabilities, and the emerging AI ecosystem that integrates natural‑language interfaces and Spark AI services.

DataFunSummit

Nov 9, 2023

Spark 3.4 New Features Overview: Community Updates, SQL Enhancements, PySpark, Streaming, and AI Ecosystem

The presentation begins with an introduction to Databricks, the company behind Apache Spark, and highlights the rapid growth of the Spark community over the past decade, including billions of Maven downloads, hundreds of thousands of Stack Overflow questions, and contributions from thousands of developers worldwide.

Community Updates showcase key metrics: over 1 billion Maven downloads, more than 100 k Stack Overflow questions, 3 600+ GitHub contributors, 100+ data sources, 440 k commits, and usage across 200+ countries.

SQL Features introduced in Spark 3.4 include:

Setting default values for table columns, with support for CSV, JSON, ORC, and Parquet formats.

New TIMESTAMP_NTZ type to avoid timezone‑dependent semantics.

Lateral column alias references, allowing a column alias to be used later in the same SELECT.

Parameterized SQL queries to reduce injection risk.

Bloom filter joins enabled by default to reduce shuffle I/O for large joins.

The OFFSET clause for pagination.

Table‑valued generator functions (e.g., EXPLODE) now easier to read. GROUP BY ALL and ORDER BY ALL to automatically include all grouping or ordering columns.

PySpark and Streaming enhancements feature:

Spark Connect : a lightweight client that sends query plans via gRPC to a long‑running Spark driver, enabling use from IDEs, Jupyter notebooks, and other environments.

Python UDF memory profiling to monitor memory usage per UDF execution.

Asynchronous progress tracking for micro‑batch streaming to overlap log commit work with computation.

Python arbitrary stateful processing APIs for custom streaming state handling.

AI Ecosystem introduces Spark AI, which allows users to describe data processing tasks in natural language; the system generates DataFrames, performs transformations, visualizations, and even creates UDFs. It also provides explain and verification APIs to validate generated pipelines.

The article concludes with a preview of future directions, such as expanding DataFrame operations, improving text‑to‑SQL generation, adding user‑defined table‑value functions, and facilitating test‑case generation, inviting readers to explore Spark AI at http://pySpark.ai .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Streaming Databricks PySpark

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.