Cloud Native 9 min read

Optimizing Offline Pod Scheduling with Koordinator and Yarn-Operator

To reduce resource contention and improve offline task reliability, this article examines the challenges of using Koordinator with Hadoop Yarn pods on Kubernetes, proposes real‑time resource reporting and task‑level eviction strategies, details community and custom solutions, and outlines future enhancements with Volcano.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Optimizing Offline Pod Scheduling with Koordinator and Yarn-Operator

Background

The company's cloud platform uses Koordinator to schedule and manage offline pods, while the big data platform relies on Hadoop Yarn for offline jobs. Initially, Yarn ran as a resident pod in the Kubernetes cluster, with Koordinator applying throttling and eviction to maintain online service stability (one offline pod per node). This approach has two drawbacks:

When an offline pod requests 16 CPU / 32 GB, Koordinator may limit usable resources to 10 CPU, yet the ResourceManager still schedules tasks assuming the full 16 CPU, causing resource contention and longer task runtimes.

Koordinator evicts a pod only when both the offline pod resource satisfaction (available offline resources / requested resources < 60%) and actual usage (used cores / limit cores > 80%) thresholds are met. If a pod is under heavy load for about an hour and then Koordinator throttles it below the satisfaction threshold, the pod is evicted and the running task fails, potentially extending completion time beyond user expectations.

Proposed Solution

To address these pain points, the ideal solution includes:

Real‑time reporting of the maximum available resources of Yarn pods, enabling the ResourceManager to schedule tasks based on actual resource availability and avoid contention.

Modifying Koordinator’s eviction mechanism to target tasks within an offline pod rather than the pod itself, allowing selective task eviction or slower execution without full pod termination.

Community Solution

The Koordinator community provides a yarn-operator component that reports NodeManager (NM) available resources to the ResourceManager via a gRPC updateNodeResource call.

The current yarn-operator does not support task‑level eviction.

Configuration for the Yarn cluster is mounted as a ConfigMap, allowing the yarn-operator to invoke Yarn‑RM RPC interfaces. When NM resources change, the operator reports the updated resources to the ResourceManager. This approach introduces risks: high request volume can increase RM load, and the lack of task‑level eviction may prolong task runtimes in certain scenarios.

Solution Optimization

The optimized approach involves custom development tailored to the internal Yarn usage, focusing on scenarios where each Kubernetes node maps to a single NM.

1. Resource Reporting

Report resources only when the NM's maximum available resources change by more than 10% (i.e., (recorded NM CPU – current max CPU) / current max CPU > 10%). This reduces reporting frequency.

Introduce an aggregation layer between yarn-operator and the ResourceManager that consolidates NM resource data and reports it periodically, further easing RM pressure.

2. Task Eviction

Custom developments were made in both yarn-pod and Koordinator:

A kill-task script embedded in the Yarn pod accepts CPU and memory parameters; internal logic decides which tasks to terminate.

A background service inside the Yarn pod receives eviction requests (CPU and memory) from Koordinator’s koordlet and periodically executes the kill-task script.

The koordlet eviction logic was modified to first invoke the Yarn pod’s service, achieving task‑level eviction before considering pod termination.

Summary

In the internal 360 environment, some Kubernetes clusters exhibit high CPU utilization (25% average, up to 40% during peak hours, and over 70% on certain nodes). Concurrent offline tasks exacerbate resource throttling and pod eviction, degrading offline service quality. By reporting NM available resources, the ResourceManager can schedule tasks based on actual limits, reducing contention. Additionally, task‑level eviction improves offline task success rates under tight resource conditions.

The offline pod deployment also leverages HPA for automatic scaling. To avoid aggressive down‑scaling of pods still running many tasks, the pod-deletion-cost feature can be used to prioritize which pods are removed.

Future Plans

The current offline task execution architecture will continue to evolve. By 2025, each task will run in its own pod, and to mitigate scheduler pressure from a large number of offline pods, the platform will integrate the Volcano component, utilizing its scheduler, pod groups, and queues for improved scheduling and resource allocation.

cloud nativeBig DataKubernetesResource SchedulingYARNKoordinator
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.