Big Data 12 min read

Parallel Execution of Multiple Spark Jobs to Optimize Resource Utilization and Reduce Parquet File Count

This article examines how to run several Spark jobs concurrently on a shared SparkContext, balancing full CPU‑vcore utilization with the need to generate fewer Parquet files, and presents practical experiments, scheduling strategies, and performance results.

Architect

May 21, 2020

Parallel Execution of Multiple Spark Jobs to Optimize Resource Utilization and Reduce Parquet File Count

When using distributed systems like Apache Spark, efficiently utilizing limited cluster resources while minimizing the number of output Parquet files is a common challenge. The article focuses on the latter problem: executing multiple Spark jobs in parallel to make the most of available CPU‑vcores.

Background : In the author's data platform, both streaming and batch tasks write results to AWS S3 as Parquet files partitioned by data type and daily interval. Two implementation approaches are described:

for type in types:
    for interval in intervals:
        df.filter(df.type==type).filter(df.interval==interval).write.parquet("s3://data/type=%s/interval=%s" % (type, interval))

and

df.write.partitionBy("type", "interval").mode("append").parquet("s3://data")

The first method creates an explicit Spark job per type‑interval pair, offering better control and fault tolerance; the second appears simpler but does not improve write speed and still writes files serially within each task.

Resource vs. File Count Trade‑off : Fully utilizing resources (e.g., 12 CPU‑vcores → 12 parallel tasks → 12 output files) can lead to many small Parquet files, hurting downstream query performance. Reducing file count by writing from fewer partitions saves downstream cost but leaves some CPU‑vcores idle.

The proposed compromise is to run several write‑jobs concurrently, each using a subset of the CPU‑vcores, thereby keeping both file count low and resource usage high.

Feasibility Analysis : Spark’s scheduling pipeline involves SparkContext → DAGScheduler → TaskScheduler → SchedulerBackend → Executor. Jobs submitted to a single SparkContext are sequential within a thread but can run in parallel across different threads. Spark supports FIFO and FAIR scheduling modes, with FAIR pools allowing more equitable sharing of resources.

Practical Exploration : The author implemented a test that reads a Parquet dataset once, caches it, and then launches five threads, each submitting a write job that coalesces to 11 partitions (55 total CPU‑vcores ÷ 5 jobs). The Scala code used is:

var df = spark.read.parquet("s3://data/type=access/interval=1551484800").repartition(55)
// df.cache()
// val c = df.count()
// println(s"${c}")

val jobExecutor = Executors.newFixedThreadPool(5)
for ( _ <- Range(0, 5) ) {
  jobExecutor.execute(new Runnable {
    override def run(): Unit = {
      val id = UUID.randomUUID().toString()
      df.coalesce(11).write.parquet(s"s3://data/test/${id}")
    }
  })
}

The first experiment failed because Spark’s lazy evaluation caused all five jobs to compete for the same read‑phase tasks, leaving only one active stage. Caching the DataFrame before launching the write jobs resolved this, but introduced higher memory pressure, leading to executor evictions.

Key take‑aways from the experiments:

Multiple jobs can run in parallel if resources are available and the shared data is cached.

Task scheduling strategy (FIFO vs. FAIR) influences how TaskSetManagers are interleaved.

Memory must be sized appropriately for the increased parallelism.

Overall execution time improves: one job takes ~14 s; five sequential jobs take ~70 s; five parallel jobs finish in ~30 s, saving about 57 % of total time.

In conclusion, parallel execution of multiple Spark write jobs is a viable technique for balancing resource utilization and file‑count reduction, and the author has successfully applied it in production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Resource Management Spark parallelism Scala Parquet Job Scheduling

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.