How Linux Kernel Balances CPU Load Across Multicore Systems
This article explains the Linux kernel's CPU load‑balancing mechanism, covering concepts such as CPU load, balancing strategies, scheduling domains, groups, the PELT algorithm, CFS scheduling, trigger points, and practical code examples for both regular and real‑time tasks.
In modern computing, multi‑core CPUs are the norm, providing powerful parallel processing capabilities. However, simply having many cores does not automatically yield performance gains; the operating system must efficiently distribute processes across cores. Linux, known for its open‑source stability, implements a sophisticated CPU load‑balancing mechanism within its kernel to maximize multi‑core performance.
1. CPU Load‑Balancing Mechanism
1.1 What Is CPU Load?
CPU load differs from CPU utilization. Utilization measures the proportion of time the CPU is busy, while load reflects the number of runnable tasks waiting for execution. When utilization reaches 100%, load becomes the more accurate indicator, especially when runqueues contain differing numbers of tasks.
Early kernels measured load by runqueue depth, but this is coarse because a single task can be heavy or light. Modern schedulers sum the load of all tasks on a runqueue, providing a finer‑grained view. The Linux 3.8 kernel introduced the PELT (Per‑Entity Load Tracking) algorithm, which tracks load per scheduling entity rather than per CPU, allowing precise load attribution.
1.2 What Is Balancing?
Balancing does not merely split total load evenly; it must consider each CPU's capacity. For example, a system with six little cores and two big cores cannot distribute a total load of 800 equally (100 per core) because big cores can handle more work.
CPU capacity depends on micro‑architecture and maximum frequency. The kernel normalizes capacities, setting the strongest CPU at 1024 and scaling others accordingly. When balancing CFS tasks, the kernel uses the capacity available after accounting for real‑time (RT) and interrupt workloads.
1.3 What Is Load?
Load measures the amount of ready work that cannot obtain CPU time, analogous to a traffic jam on a lane. Tools like
topor
wreport 1‑, 5‑, and 15‑minute average loads, which must be divided by the number of CPUs for proper interpretation.
Load values above 0.7 typically warrant attention, though Linux also counts uninterruptible I/O‑waiting tasks, making raw load sometimes appear higher than perceived performance impact.
<code>for_each_possible_cpu(cpu)
nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)</code>Thus, Linux reports system load, not just CPU load.
1.4 Load‑Balancing Methods
Two approaches exist: pull (idle or lightly loaded CPUs pull tasks from busy CPUs) and push (busy CPUs push tasks to idle ones). Migration costs vary: moving tasks between logical CPUs on the same physical core is cheap, while crossing physical cores, sockets, or NUMA nodes incurs higher overhead.
1.5 Purpose of Load‑Balancing
Effective load‑balancing improves system throughput and stability. Without it, a single overloaded core can degrade overall performance, increase latency, and even cause crashes. Distributing work evenly across cores maximizes resource utilization and reduces power consumption.
1.6 Triggering Load‑Balancing
Load‑balancing is invoked periodically by the scheduler tick and when a CPU becomes idle. The kernel’s
scheduler_tickcalls
trigger_load_balance, which may raise a soft IRQ to start balancing. An idle CPU selected as the "idle load balancer" performs the actual migration.
When multiple idle CPUs exist, the kernel simply picks the first one in the idle mask; on heterogeneous systems, power‑efficiency considerations may influence the choice.
2. Load‑Balancing Implementation Details
2.1 Key Data Structures and Functions
Each CPU has its own runqueue , a list of runnable tasks. The kernel groups CPUs into scheduling domains and scheduling groups based on topology and performance similarity.
Scheduling domains form a hierarchical tree: leaf domains contain a single CPU, higher‑level domains group clusters or NUMA nodes. The
struct sched_domaindefines parameters such as balance intervals, busy factors, and imbalance thresholds.
<code>struct sched_domain {
struct sched_domain *parent;
struct sched_group *groups;
cpumask_t span;
unsigned long min_interval;
unsigned long max_interval;
unsigned int busy_factor;
unsigned int imbalance_pct;
unsigned long long cache_hot_time;
unsigned int cache_nice_tries;
unsigned int per_cpu_gain;
int flags;
unsigned long last_balance;
unsigned int balance_interval;
unsigned int nr_balance_failed;
};</code>Each
struct sched_grouprepresents a set of CPUs that share similar characteristics.
<code>struct sched_group {
struct sched_group *next;
cpumask_t cpumask;
unsigned long cpu_power;
};</code>The core balancing routine is
load_balance(), which selects the busiest runqueue in a domain and migrates tasks to lighter‑loaded CPUs.
rebalance_tick()walks the domain hierarchy each tick to decide whether to invoke
load_balance()based on timing and load thresholds.
2.2 CFS Scheduling
The Completely Fair Scheduler (CFS) orders tasks in a red‑black tree by virtual runtime (
vruntime). Tasks that have waited longer have smaller
vruntimeand are scheduled first, ensuring fairness even with many processes.
2.3 Load‑Balancing Triggers
Balancing occurs when a CPU’s runqueue becomes empty (idle) or periodically via the scheduler tick. The kernel checks each domain’s
balance_intervaland imbalance percentage to decide whether to migrate tasks.
3. Software Architecture of Load‑Balancing
Linux separates load‑balancing into a core module and class‑specific modules (e.g., CFS, RT, Deadline). For CFS tasks, the kernel tracks per‑task load and per‑group capacity, builds a hierarchical sched_domain tree, and performs migrations based on the calculated imbalance.
4. Practical Load‑Balancing Steps
4.1 How Balancing Works
In a system with four little cores and four big cores, the kernel creates two levels of domains: a base MC (multi‑core) domain for each cluster and a top‑level DIE domain covering all CPUs. Each domain contains groups that form a circular linked list, enabling efficient traversal during balancing.
4.2 Basic Balancing Process
Identify the busiest group in the current domain.
Select the busiest CPU within that group (if the group contains multiple CPUs).
Pull a number of tasks proportional to the imbalance from the busy CPU to the idle CPU.
After balancing the MC domain, the kernel proceeds to higher‑level domains (e.g., DIE) and repeats the process using group‑wide load and capacity metrics.
4.3 Thread Balancing
Real‑time (RT) tasks are assigned to the most suitable cores based on priority, using
push_rt_taskand
pull_rt_task. Normal tasks rely on periodic, idle‑time, and fork/exec‑time balancing.
4.4 Interrupt Balancing
Interrupts (hardware and soft) also contribute to load. The kernel can set interrupt affinity or use RPS (Receive Packet Steering) to distribute network processing across CPUs, preventing a single core from becoming a bottleneck.
4.5 Additional Considerations
Balancing incurs overhead, so each domain defines minimum intervals and imbalance thresholds to limit frequency. Misfit tasks—those poorly matched to a CPU’s capacity—are up‑migrated to stronger cores to improve latency and user experience. Proper balancing also reduces power consumption by allowing more cores to run at lower frequencies.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.