Big Data 18 min read

Improving Resource Utilization and Isolation in Bilibili Big Data Clusters with the Amiya Over‑commit Component

By deploying the self‑developed Amiya over‑commit component together with kernel‑level cgroup memory isolation, explicit task priorities, OOM‑priority killing, and asynchronous reclamation, Bilibili’s big‑data clusters boosted daily resource utilization by about 15 %, eliminated DataNode OOM kills, cut memory‑reclaim latency to zero, and achieved a further 9 % overall efficiency gain.

Bilibili Tech

Jun 4, 2024

Improving Resource Utilization and Isolation in Bilibili Big Data Clusters with the Amiya Over‑commit Component

Background : The rapid growth of Bilibili's big‑data services has led to a surge in resource requests. However, cluster utilization remains lower than expected because business teams often request more resources than they actually need and because the system lacks sufficient isolation between high‑ and low‑priority tasks, causing high‑priority jobs to be affected or even killed.

The team introduced a self‑developed over‑commit component called Amiya . Amiya inflates the reported resource usage to the scheduler based on actual machine load, allowing more tasks to be scheduled. It also evicts a controlled number of tasks when usage approaches real consumption. (See the original article for details.)

After Amiya went live, overall task density increased and daily resource utilization rose by roughly 15%. The higher utilization, however, exposed stronger resource‑isolation problems, especially for high‑priority tasks.

Requirement Analysis : Monitoring across four dimensions (CPU, network, disk I/O, memory) showed that CPU and network were not bottlenecks. Disk I/O appeared busy but could not be conclusively deemed a bottleneck. Memory usage was high: OOM‑killer events were frequent, and both business and system sides suffered from memory pressure.

Business‑side issues:

No task priority – resources compete without distinction.

OOM‑killer terminating DataNode caused data‑loss risk.

System‑side issues:

When memory usage hits the cgroup limit, frequent direct memory reclamation leads to high allocation latency.

Because of the memory bottleneck, the over‑commit component could not increase its over‑commit ratio further.

Solution : The primary remedy is to employ kernel‑level memory isolation via cgroups and to define explicit task priorities.

Task Priority Definition : After discussions with business owners, the following hierarchy was set:

DN (DataNode) – highest priority, as its failure can cause data loss.

NM (NodeManager) – medium priority; its loss is tolerable because tasks can be retried.

Task containers (CNTx) – lowest priority.

The Hadoop cgroup topology was restructured accordingly, following two principles: higher hierarchy = higher priority, and parameters at the same level differentiate priorities.

OOM‑Kill DN Problem : The OS team enabled the kernel’s OOM‑priority feature. When OOM occurs, the kernel first selects the lowest‑priority cgroup for killing, protecting high‑priority DN tasks. The configuration hierarchy is DN > NM > CNTx.

Memory Allocation Latency : The OS team activated the memcg asynchronous reclamation feature, setting a waterline at 95% of memory.limit_in_bytes. When usage crosses this threshold, a background thread starts reclaiming memory early, reducing direct reclamation frequency.

Benefit Estimation – DN Kill Frequency : Before enabling the OOM‑priority feature, DN was killed about once per week. After two weeks of testing, no DN kills were observed.

Command‑line evidence (wrapped in

tags):</p><pre><code>$ dmesg | grep constraint= | grep /hadoop-low | wc -l

$ dmesg | grep constraint= | grep /hadoop-high | wc -l

Benefit Estimation – Memcg Direct Reclaim : Two machines were tested over six days, alternating the feature on and off. When the feature was on, both the count and latency of direct memory reclamation dropped to zero, confirming the expected effect. CPU Scheduling Delay : The OS team also evaluated the Group Identity (GI) feature, which minimizes CPU wake‑up latency for high‑priority tasks. At a 65% CPU load, GI showed negligible benefit. At an 83% load, GI kept 99.99% of scheduling events within 0‑2 ms, whereas without GI the 0‑2 ms proportion fell to 97.1%. Gray‑scale Testing & Problem Analysis : During gray‑scale rollout, Spark’s ESS component triggered frequent shuffle corruption. Investigation showed the issue was not caused by the new mixing kernel feature but was amplified by higher overall load. The root cause was traced to an XFS bug related to partial writes, iomap handling, and memory reclamation. Key findings from kernel logs and community discussions indicated that the XFS buffer‑IO path lacked proper iomap validity checks. Problem Fix : A hot‑patch was built that adds a validity_cookie to the struct iomap and introduces a new struct iomap_bili wrapper with a magic number for sanity checking. Sample diff:

@@ -89,6 +98,7 @@ struct iomap {</code><code>    void *inline_data;</code><code>    void *private; /* filesystem private */</code><code>    const struct iomap_page_ops *page_ops;</code><code>+   u64 validity_cookie; /* used with .iomap_valid() */</code><code> };

Additional code shows the use of the global lock in the KLP shadow API, which was avoided in the final solution. Landing Effect : After applying the fix, random sampling of machines showed a 9% increase in overall resource utilization while keeping high‑priority tasks safe. Current Status & Outlook : The mixed‑kernel feature is now in gray‑scale testing on thousands of machines. Further work includes tuning the over‑commit ratio, adjusting scheduler parameters, and exploring memory‑cold‑page compression to squeeze additional efficiency. References :

Shuffle data error diagnosis – https://github.com/apache/spark/pull/33451

XFS issue discussion – https://lore.kernel.org/linux-xfs/[email protected]/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Resource Management Performance Evaluation cgroup kernel overcommit memory isolation OOM Priority

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.