Scaling LinkedIn’s Hadoop YARN Cluster Beyond 10,000 Nodes: Challenges and Solutions
This article examines how LinkedIn tackled severe scheduling slowdowns when its Hadoop YARN cluster grew to nearly 10,000 nodes, analyzes the root causes of resource‑manager bottlenecks, and describes the fairness‑redefinition and scheduling‑logic patches that restored throughput and scalability.
Currently, Yarn plays a crucial role in Xiaomi's big‑data ecosystem, with the largest internal Yarn cluster approaching 6,000 nodes and rapidly expanding, soon reaching the performance ceiling of a single‑cluster scheduler.
Inspired by a LinkedIn engineering blog, we explore how LinkedIn solved scheduling slowdowns when its Hadoop YARN cluster approached 10,000 nodes.
LinkedIn Hadoop Overview
LinkedIn uses Hadoop as the foundation for large‑scale data analysis and machine learning. The overall data volume grows exponentially, and the company doubles its cluster size each year; the largest cluster now has about 10,000 nodes, making it one of the biggest Hadoop deployments on Earth. Scaling Hadoop YARN has been a long‑standing infrastructure challenge.
YARN Cluster Problems
Unlike HDFS’s NameNode, YARN stores its metadata on a separate set of nodes, while the ResourceManager is lightweight, maintaining only a small amount of metadata. Hadoop’s storage scalability issues appeared earlier (around 2016), but YARN remained a stable component for a long time.
The cluster size doubles annually, and we knew that eventually YARN’s scalability would become a problem because the ResourceManager’s single‑threaded scheduler cannot sustain unlimited growth. Nevertheless, we assumed YARN would keep working until a new technology emerged, an assumption that lasted until early 2019.
LinkedIn HDFS storage capacity, NameNode object count, and YARN compute capacity trends
Historically, we deployed two Hadoop clusters in one data center: a primary cluster serving main business traffic (with compute and storage tightly coupled) and a secondary cluster used mainly for storage, leaving its compute resources idle. To improve utilization, we merged the secondary cluster’s compute nodes into the primary cluster as an independent partition (label‑based scheduling).
Two months after the merge, the cluster began to exhibit problems.
Symptoms
After merging, the cluster consisted of two partitions: ~4,000 nodes (primary) and ~2,000 nodes (secondary). Users experienced hour‑level job submission delays even though ample resources were available.
Initial debugging suspected a bug in YARN’s partition handling, but no code issues were found. We suspected that the 50% increase in cluster size overloaded the ResourceManager, causing scheduling performance to lag.
We examined the AggregatedContainerAllocation metric, which reflects container allocation speed. Before the merge, the primary cluster allocated about 500 containers per second, the secondary 250 per second. After merging, the peak reached 600 containers per second, but allocation often dropped to 50 containers per second for several hours.
Further analysis revealed costly operations such as DNS lookups that were annotated with @synchronized, limiting concurrency. Removing these synchronized blocks improved throughput by 10%, but user‑visible latency remained significant.
Redefining Fairness to Relieve Pressure
Audit logs showed that the scheduler frequently allocated containers to a single queue for extended periods before switching to another queue. Even when overall throughput was normal (600 containers/s), some queues still suffered hour‑level delays while others experienced none. This indicated that the scheduler’s queue‑selection logic, based on utilization‑ordered capacity scheduling, was causing imbalance.
Consider two queues, A (10% utilization) and B (20% utilization). The scheduler prefers A because of its lower utilization. In high‑throughput scenarios, a temporary deadlock can occur: if A’s jobs are short‑running and B’s jobs are long‑running, the scheduler keeps assigning containers to A, keeping A’s utilization around 10% (or even decreasing) while B’s utilization drops from 20% to ~19%, preventing B from receiving resources.
In LinkedIn’s case, the primary partition hosts AI training and long‑running Spark jobs, while the secondary partition runs short‑lived MapReduce jobs with high container churn. When the ResourceManager arbitrarily schedules containers across partitions, the problem disappears; after the merge, the fairness issue became evident.
We mitigated the issue by randomizing queue selection with equal probability instead of basing it on utilization, which temporarily alleviated the delay and the patch was contributed back to Apache Hadoop.
Fundamental Cause of Low Efficiency
Although the fairness tweak helped, the core cause of slow scheduling remained unresolved. YARN’s scalability is still a pressing concern that requires deeper investigation.
Comparing pre‑ and post‑merge throughput shows a best‑case performance of about 80% (≈600 containers/s) and a worst‑case of only 7% (≈50 containers/s), highlighting irregularities in the partitioned scheduling logic.
By default, YARN’s ResourceManager performs synchronous scheduling triggered by NodeManager heartbeats, which dispatches unscheduled containers to the appropriate partition. If a container’s label does not match the node’s label, the container cannot be placed.
When scheduling tasks to a queue, the scheduler follows FIFO order. Suppose a node in the primary partition heartbeats; the scheduler picks queue A, and the first 100 pending jobs from queue A request resources from the secondary partition. The scheduler’s attempts to match these containers to nodes are inefficient and often fail, causing high heartbeat overhead and slow scheduling.
To address this, we modified the logic so that when a node from either partition heartbeats, the scheduler considers whether the task was submitted to the primary or secondary partition. After this change, average throughput before and after the merge remained similar, but in the worst case where both partitions are heavily loaded, performance improved nine‑fold. This patch was also contributed to the community.
The next article will explore additional scalability efforts made by LinkedIn.
Translated from LinkedIn Engineering Blog .
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.