Big Data 12 min read

ByteDance’s Optimizations to Hadoop YARN: Enhancing Utilization, Multi‑Load Scenarios, Stability, and Multi‑Region Active‑Active

This article describes ByteDance’s four‑year series of customizations to Hadoop YARN—covering utilization improvements, multi‑load scenario optimizations, stability enhancements, and multi‑region active‑active deployment—along with practical production experiences, architectural details, and future work directions.

DataFunTalk
DataFunTalk
DataFunTalk
ByteDance’s Optimizations to Hadoop YARN: Enhancing Utilization, Multi‑Load Scenarios, Stability, and Multi‑Region Active‑Active

Introduction – The article outlines ByteDance’s four‑year effort to optimize Hadoop YARN for higher utilization, multi‑load workloads, stability, and cross‑region active‑active operation, summarizing the main topics covered.

YARN Overview – YARN (Yet Another Resource Negotiator) is the resource manager for Hadoop clusters, sitting between the physical nodes and distributed compute engines (MapReduce, Spark, Flink). The architecture includes the ResourceManager (cluster brain) and NodeManager (resource provider and container lifecycle manager).

ByteDance Customizations

Utilization Boost – Multi‑threaded FairScheduler (7× throughput, >3K containers/s), node‑level Dominant Resource Fairness (DRF) for balanced CPU/memory allocation, scaling the cluster to 20 000 nodes, and mixing offline, streaming, and online workloads to raise CPU usage above 90 %.

Multi‑Load Scenario Optimizations – Developed a Gang Scheduler with All‑or‑Nothing semantics and millisecond‑level latency for streaming and training jobs, refined CPU sharing policies with CGroup extensions, added GPU‑aware scheduling, and introduced load‑aware container placement to reduce fetch‑failed errors by ~40 %.

Stability Enhancements – Made HDFS a weak dependency, introduced container tiering and eviction, added uncontrolled container cleanup, and implemented safety‑mode clusters for disaster‑recovery.

Cross‑Region Active‑Active – Unified global YARN UI, abandoned strict data locality in multi‑datacenter setups, and added safety‑mode and dynamic resource adjustment to support seamless multi‑region operation.

Future Work – Plans include further improving physical utilization, isolation, kill‑rate control, GPU resource mixing, enriching Gang Scheduler predicates, and lowering scheduling latency.

Team Introduction & Recruitment – The YARN team supports ByteDance’s core products (recommendation, search, advertising) across massive clusters, holds dozens of patents, and is hiring in Beijing and Hangzhou (job link provided).

Community & Resources – The article ends with calls to like, share, and join the DataFunTalk big‑data community, plus links to related ByteDance technical posts.

big dataresource managementyarnHadoopCluster OptimizationByteDance
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.