Big Data 11 min read

Performance Tuning of Hive on Spark in YARN Mode

This article explains how to optimize Hive on Spark running on YARN, covering YARN node resource configuration, Spark executor and driver memory settings, dynamic allocation, parallelism, and key Hive parameters to achieve superior performance compared to Hive on MapReduce.

Big Data Technology Architecture

May 15, 2020

Performance Tuning of Hive on Spark in YARN Mode

Hive on Spark delivers much better performance than Hive on MapReduce while providing the same functionality, and HiveQL can run unchanged.

This guide focuses on tuning Hive on Spark when it runs in YARN mode, assuming a node with 32 CPU cores and 120 GB memory.

YARN resource configuration

Set the number of vcores and memory available to YARN based on the node’s capacity:

yarn.nodemanager.resource.cpu-vcores=28

yarn.nodemanager.resource.memory-mb=100*1024

Reserve cores and memory for the OS, HDFS DataNode and NodeManager, leaving the rest for YARN.

Spark executor and driver settings

Allocate executor cores (e.g., 4) so that the total number of executors fits the available cores (28 cores → 7 executors). Calculate executor memory (≈14 GB) and set overhead to 15‑20%:

spark.executor.cores=4

spark.executor.memory=12g

spark.executor.memoryOverhead=2g

Driver memory should be chosen based on total YARN memory (X): 12 GB if X > 50 GB, 4 GB if 12 GB < X < 50 GB, 1 GB if 1 GB < X < 12 GB, otherwise 256 MB.

Number of executors and dynamic allocation

Maximum executors per node = 7; total executors = nodes × 7 (e.g., 40 nodes → 280 executors). Use static allocation for benchmarks, but enable dynamic allocation in multi‑user production environments.

Parallelism and reducer settings

Ensure enough tasks are generated to keep all executors busy. Adjust hive.exec.reducers.bytes.per.reducer to control reducer count; Spark is less sensitive to this value than MapReduce.

Hive configuration parameters

Key settings that affect performance include:

hive.optimize.reducededuplication.min.reducer=4
hive.optimize.reducededuplication=true
hive.merge.mapfiles=true
hive.merge.mapredfiles=false
hive.merge.smallfiles.avgsize=16000000
hive.merge.size.per.task=256000000
hive.merge.sparkfiles=true
hive.auto.convert.join=true
hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=20M   // increase for Spark, e.g., 200M
hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5
hive.map.aggr=true
hive.optimize.sort.dynamic.partition=false
hive.stats.autogather=true
hive.stats.fetch.column.stats=true
hive.compute.query.using.stats=true
hive.limit.pushdown.memory.usage=0.4
hive.optimize.index.filter=true
hive.exec.reducers.bytes.per.reducer=67108864
hive.smbjoin.cache.rows=10000
hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824
hive.optimize.ppd=true

Adjust hive.auto.convert.join.noconditionaltask.size to a larger value for Spark because it uses rawDataSize statistics instead of totalSize.

Pre‑warming YARN containers

Enable hive.prewarm.enabled=true and set hive.prewarm.numcontainers (default 10) to reduce first‑query latency by pre‑starting executors.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance tuning Hive YARN Spark Cluster Configuration

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.