Big Data 8 min read

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

By parallelizing Spark’s driver‑side commit, trash, and move phases—previously single‑threaded operations that caused costly copy‑on‑rename when writing massive files to object storage—the Tencent Cloud EMR case achieved over a tenfold (1,100 %) speedup, making object storage a viable alternative to HDFS.

Tencent Cloud Developer

Oct 19, 2020

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

With the evolution of big‑data architectures, storage‑compute separation has become a popular way to reduce storage costs and schedule compute resources on demand. Storing data in object storage is cheaper than HDFS, but its write performance for massive files is often much lower.

Tencent Cloud Elastic MapReduce (EMR) provides a managed, elastic Hadoop service that supports Spark, HBase, Presto, Flink, Druid and other big‑data frameworks. A recent customer case used the Spark component of EMR for computation while persisting data in object storage. During performance tuning it was discovered that Spark’s write throughput on object storage was dramatically slower—approximately 29 × slower than writing to HDFS.

To investigate, the Spark data‑output process was dissected. Each task writes its result to a temporary directory ( _temporary/task_[id]) on the executor. After all tasks finish, the driver performs three sequential phases: commitJob (merging temporary files), trashFiles (moving overwritten data to a recycle bin), and moveFiles (moving the merged files to the final Hive table location).

Profiling showed that the executor time was comparable for both storage back‑ends, while the total job duration was dominated by driver‑side operations. All three driver phases were identified as performance hotspots. The root cause is that each phase iterates over files with a single‑threaded loop, and on object storage a rename operation requires copying the entire file rather than a simple metadata update, further amplifying the delay.

Source‑code analysis revealed:

JobCommit phase: Spark uses Hadoop’s FileOutputCommitter, which in Hadoop 2.x defaults to version 1 algorithm—single‑threaded traversal and merge of task directories.

TrashFiles phase: Files are moved to a trash directory sequentially, which becomes a bottleneck when many files need to be overwritten.

MoveFiles phase: Similar single‑threaded file moves are performed, suffering the same latency on object storage.

To mitigate these issues, the three phases were refactored to employ multithreaded, parallel file handling. Benchmark tests using SparkSQL to write 5,000 files to both HDFS and object storage showed substantial gains: HDFS write performance increased by 41 % and object‑storage write performance surged by 1,100 % (over tenfold).

The study demonstrates that single‑threaded file‑processing logic is a common bottleneck in storage‑compute separation scenarios. Parallelizing commit, trash, and move operations can dramatically improve write performance, making object storage a more viable option for massive‑file workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Distributed computing Spark Object Storage EMR

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.