Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features
This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.
The session, presented by NVIDIA Deep Learning Architect Zhao Yuanqing, introduces the RAPIDS Accelerator for Apache Spark 3.0, which uses NVIDIA GPUs to accelerate Spark SQL and DataFrame operations without requiring code changes—simply enable spark.rapids.sql.enabled=true .
GPU acceleration is well‑suited to big‑data scenarios because large data volumes and highly parallel tasks map naturally to GPU architectures; benchmarks on a 10 TB TPC‑DS dataset show query runtimes reduced from 25 minutes to around 1 minute, achieving up to three‑fold speed‑ups and 55 % cost savings.
In recommendation workloads (e.g., DLRM on the Criteo dataset), moving ETL and training to GPUs cuts total processing time from 144 hours to 0.5 hour, delivering up to 160× speed‑up over legacy CPU pipelines and significant cost reductions.
The accelerator supports a wide range of Spark operators; most SQL and DataFrame operators run on GPU transparently, while unsupported operators can be reported via GitHub. High‑cardinality joins, aggregates, sorts, extensive window functions, complex UDFs, and I/O‑heavy formats (Parquet, CSV) benefit most from GPU execution.
Shuffle is a major bottleneck in Spark. RAPIDS introduces a GPU‑aware shuffle that bypasses PCIe and CPU when possible, using NVLink, GPU Direct Storage, and RDMA via the UCX library to achieve up to 30× faster data movement in network‑bound scenarios.
Version 0.2 added multi‑Spark‑version support (including Databricks 7.0ML and Google Dataproc 2.0), optimized small‑file Parquet reads, initial Scala UDF support, and accelerated Pandas UDFs by sharing GPU memory between JVM and Python processes.
Version 0.3 further improves per‑thread GPU streams, adaptive query execution on GPU, updates UCX to 1.9.0, adds support for list/struct types in Parquet, new window functions (lead/lag), and additional scalar functions (greatest, least).
For more details, the RAPIDS Accelerator source is open‑source on GitHub, and additional documentation and a Chinese Spark e‑book are available through NVIDIA and the DataFunTalk community.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.