Big Data 7 min read

Performance Comparison of SparkR with Vectorized Execution Using Apache Arrow

This article explains how SparkR’s performance compares to native Spark APIs, shows the slowdown caused by JVM‑R serialization, and demonstrates how enabling Apache Arrow’s vectorized execution in Spark 3.0 can accelerate SparkR operations by up to dozens of times.

Big Data Technology Architecture

Aug 8, 2020

Performance Comparison of SparkR with Vectorized Execution Using Apache Arrow

R is one of the most popular languages for data science, used for statistical analysis, data processing, and machine learning via RStudio addins and packages. SparkR allows R code to scale on Apache Spark, enabling interactive jobs through the R shell.

When SparkR does not need to interact with the R process, its performance is comparable to Scala, Java, and Python APIs. However, performance degrades significantly when SparkR jobs interact with native R functions or data types.

Using Apache Arrow for data exchange between Spark and R can greatly improve performance. This article outlines the interaction between Spark and R in SparkR and compares non‑vectorized and vectorized execution performance.

Spark and R Interaction

SparkR supports a rich set of ML and SQL‑like APIs and also provides APIs for direct interaction with R code, such as seamless conversion between Spark DataFrames and R DataFrames and distributed execution of R built‑in functions on Spark DataFrames.

In most cases, the performance of other language APIs in Spark is consistent because the execution stays in the JVM. For example, the following Scala and R calls each take about one second:

// Scala API
// ~1 second
sql("SELECT id FROM range(2000000000)").filter("id > 10").count()

# R API
# ~1 second
count(filter(sql("SELECT * FROM range(2000000000)"), "id > 10"))

When the job requires R built‑in functions or type conversion, performance differs dramatically. The following example shows a Scala implementation that runs in about one second, while the equivalent SparkR code takes roughly fifteen seconds:

// Scala API
val ds = (1L to 100000L).toDS
// ~1 second
ds.mapPartitions(iter => iter.filter(_ < 50000)).count()

# R API
df <- createDataFrame(lapply(seq(100000), function(e) list(value = e)))
# ~15 seconds - 15 times slower
count(dapply(df, function(x) as.data.frame(x[x$value < 50000,]), schema(df)))

Collecting data to the driver also shows a large gap: the Scala version completes in about 0.2 seconds, while SparkR needs around eight seconds:

// Scala API
// ~0.2 seconds
val df = sql("SELECT * FROM range(1000000)").collect()

# R API
# ~8 seconds - 40 times slower
df <- collect(sql("SELECT * FROM range(1000000)"))

The slowdown is caused by serialization/deserialization between the JVM and R, which uses an inefficient row‑wise format that does not exploit modern CPU features such as pipelining or SIMD.

Native Implementation

In Spark 3.0, SparkR introduces a vectorized implementation that leverages Apache Arrow to exchange data in a columnar format, dramatically reducing (de)serialization costs.

The vectorized approach avoids row‑wise (de)serialization and uses Arrow’s columnar layout with SIMD pipelines. It is disabled by default; enable it by setting spark.sql.execution.arrow.sparkr.enabled=true. Note that dapplyCollect() and gapplyCollect() are not yet vectorized; use dapply() and gapply() instead.

Benchmark Results

Benchmarks on a 500,000‑record dataset show that enabling vectorized execution improves collect() and createDataFrame() by roughly 17× and 42×, and speeds up dapply() and gapply() by 43× and 33× respectively.

These results demonstrate that Apache Arrow can provide substantial performance gains when exchanging data between different systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Vectorized Execution Apache Arrow SparkR

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.