Big Data 22 min read

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster

Amiya, a self‑developed dynamic over‑commit component for Bilibili’s offline big‑data cluster, inflates reported resources on under‑utilized nodes and adjusts them when load rises, adding roughly 683 TB of memory and 137 k vCores, boosting per‑node memory by 15 % and CPU usage by over 20 % while keeping eviction rates below 3 %.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster

This article introduces Amiya, a self‑developed component designed to address resource shortage in Bilibili’s offline big‑data platform. Over the past year, the offline cluster faced two main challenges: rapid expansion of node count leading to high Pending rates, and the need to improve resource utilization without adding physical machines.

Amiya implements dynamic overcommit (超配) on individual physical machines and collaborates with the cloud platform to enable mixed‑deployment (混部). After deployment, Amiya added approximately 683 TB of allocatable Memory and 137 K vCores to the Yarn offline cluster, as shown in Figures 1 and 2.

Architecture (Figure 3) – The system consists of AmiyaContext, StateStoreManager, CheckPointManager, NodeResourceManager, OperatorManager, InspectManager, and AuditManager. Each module’s responsibilities are described, with NodeResourceManager handling the core overcommit logic and OperatorManager providing interaction with Yarn and K8s.

Overcommit Logic – Based on the principle that users request more resources than they actually use, Amiya reports inflated resource amounts to the scheduler when a node’s CPU/Memory usage is low (OverCommit) and reduces reported resources when usage is high (DropOff). The decision process uses thresholds such as OverCommitThreshold, DropOffThreshold, and capacity limits (CPU/MemoryRatio). Three‑level validation (range, magnitude, time interval) prevents excessive oscillation.

Resource‑Limit Optimization – Different machine models (48‑core vs 96‑core) require distinct CPU/Memory ratios. Experiments showed that increasing memory overcommit to 1.5× physical memory improves CPU utilization for 48‑core nodes, while 96‑core nodes still suffer from memory bottlenecks. Adding extra 128 GB memory to 96‑core nodes raised the effective Memory‑to‑CPU ratio and increased CPU usage from ~45 % to ~70 % (Figures 8‑10).

Eviction Strategies – Amiya implements three eviction layers: Container eviction (triggered after DropOff), Application eviction (targeting large‑disk jobs when SSD usage exceeds a threshold), and Node eviction using K8s‑style Taints (OOMTaint, HighLoadTaint, HighDiskTaint, LowResourceTaint, NeedToStopTaint). ExtremeKill is introduced to force eviction of the largest memory‑consuming container when no other containers can be removed.

Mixed‑Deployment Mode – Amiya is deployed as a sidecar inside the NodeManager pod in Yarn‑on‑K8s clusters. It receives the pod’s real resource limits via a Unix domain socket, reads cgroup usage, computes the overcommit target, and updates the NodeManager’s resource allocation (Figures 13‑14).

Results – In the offline main cluster (≈5 000 nodes), Amiya contributed <683 TB memory and 137 K vCores of additional allocatable resources. Daily per‑node gains were 33.26 GB memory (+15.62 %) and 18.56 % CPU usage (+22.04 % for the dominant configuration). Eviction rates stayed low (0.56 %–2.73 %). In mixed‑deployment clusters, CPU utilization rose by ~10 % after full rollout (Figure 17).

Future Work – Plans include kernel‑level OOM handling, finer‑grained application‑level eviction, and a Master‑Worker architecture for global resource profiling and more flexible max‑ratio overcommit.

Big DataYARNCluster ManagementBilibiliAmiyaResource Overcommit
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.