Cloud Native 19 min read

AutoKH: A Mixed‑Workload Resource Management Solution on Kubernetes and Hadoop

AutoKH is a cloud‑native mixed‑workload framework that integrates Kubernetes and Hadoop to dynamically schedule online and offline tasks, improve CPU and memory utilization, enforce priority classes, and ensure service stability through operators, CronHPA, and resource‑control components.

HomeTech
HomeTech
HomeTech
AutoKH: A Mixed‑Workload Resource Management Solution on Kubernetes and Hadoop

Introduction: As the scale of the Home infrastructure grows, resource management faces new challenges. By researching Google Borg, Alibaba Fuxi, Tencent Caelus and combining big‑data and container characteristics, mixed‑workload technology (混部) shows significant improvement in resource utilization.

Definition: Mixed workload (混部) is a method of physical resource reuse that integrates different types of workloads on the same physical nodes to increase utilization while ensuring service stability through strict resource assessment and control.

AutoKH Overview: AutoKH is a mixed online/offline workload solution built on Kubernetes and Hadoop. It schedules resources, sets priorities, isolates workloads, and performs graceful scaling to fully utilize CPU, memory and I/O on shared nodes.

Architecture: The design follows cloud‑native principles, zero‑intrusion to K8s, modular components, and includes two entry points (A‑One pipeline for online apps, scheduling platform for offline tasks), YarnOperator, CronHPA controller, ResourceController, KH‑Agent, YarnScaleController, and priority classes.

YarnOperator Example:

@kopf.on.create("crd.autohome.com.cn","v1","nodemanagers")
def create_fn(body, spec, **kwargs):
    # ... (code omitted for brevity) ...

CronHPA: Implements tide‑type scheduling that expands online pods at 08:00 and releases resources at 23:00, allowing offline Hadoop tasks to reuse the freed capacity.

ResourceController & KH‑Agent: Provide CRDs (kh and rcs) for custom resource configuration, report node status, and dynamically adjust CPU/MEM limits based on thresholds and tolerance times.

YarnScaleController: Periodically reads node resource changes from K8s and sends adjustment commands to YARN, handling both expansion and safe contraction with a three‑minute grace period.

Environment Integration & Stability: Discusses issues such as calico‑node readiness, ipvs mode, iptables lock, and kernel module loading, with corresponding remediation steps.

Resource Priority: Uses Kubernetes PriorityClass (high‑priority‑online, low‑priority‑offline) together with custom controllers to guarantee online services while allowing offline jobs to consume surplus capacity.

CPU Manager & Topology Manager: Shows how static CPU policy, full‑pcpus‑only option, and topology‑manager‑policy enforce CPU pinning, NUMA locality, and mitigate noisy‑neighbor effects.

Results & Outlook: Mixed‑workload deployment has increased node CPU utilization from ~10 % to 50 % across 230+ applications and 400+ pods, and future work will explore second‑level resource prediction, network isolation, eBPF observability, and extending mixed workloads to middleware, databases, and real‑time computing.

kubernetesOperatorresource managementHadoopMixed Workloadcpu-managerCronHPA
HomeTech
Written by

HomeTech

HomeTech tech sharing

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.