Fundamentals 31 min read

Understanding the Linux Completely Fair Scheduler (CFS)

An in‑depth overview of Linux’s Completely Fair Scheduler (CFS) explains its design goals, data structures such as red‑black trees and virtual runtime, the evolution from O(n) and O(1) schedulers, weight‑based fairness calculations, scheduling classes, and key kernel functions that implement task selection and timing.

Deepin Linux

Nov 16, 2023

Understanding the Linux Completely Fair Scheduler (CFS)

The Completely Fair Scheduler (CFS) is the default process scheduler in the Linux kernel, responsible for allocating CPU time to runnable tasks. It uses a red‑black tree to maintain the run‑queue and a virtual clock to achieve fair time‑slice distribution, dynamically computing each task’s virtual runtime to ensure fairness and efficiency.

CFS Overview

CFS was introduced by Ingo Molnar in kernel 2.6.23 and aims to provide ideal, precise, and completely fair multitasking on real hardware. Unlike traditional schedulers, it does not rely on fixed time‑slices; instead, it distributes CPU usage proportionally based on task weights.

Scheduler Evolution

Linux has employed several milestone schedulers: the O(n) scheduler (2.4‑2.6), the O(1) scheduler (2.6.0‑2.6.22), and the CFS scheduler (2.6.23‑present). Each iteration improved scalability and performance.

O(n) Scheduler

The O(n) scheduler used a single global run‑queue, leading to O(n) traversal cost, high lock contention on multi‑core systems, mixed placement of real‑time and normal tasks, CPU idle waste, and task‑bouncing issues.

O(1) Scheduler

The O(1) scheduler introduced per‑CPU run‑queues, eliminating the global lock and reducing scheduling latency. It kept fixed time‑slices based on static priority, which worked well for responsiveness but still suffered from interaction‑latency trade‑offs.

RSDL Scheduler

Rotating Staircase Deadline (RSDL) builds on the O(1) data structures while discarding the complex interactive‑index calculations. It spreads a task’s time‑slice across a staircase of decreasing priorities, moving the task to the next priority level after each slice until it reaches the expired queue.

Scheduling Classes

Since Linux 2.6.23, scheduling classes modularize the scheduler. The main classes are:

dl_sched_class – Deadline scheduler (SCHED_DEADLINE) rt_sched_class – Real‑time scheduler (SCHED_FIFO, SCHED_RR) fair_sched_class – Completely Fair Scheduler (SCHED_NORMAL, SCHED_BATCH) idle_sched_class – Idle task (SCHED_IDLE)

struct sched_class {
    const struct sched_class *next;
    void (*enqueue_task)(struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags);
    void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
    struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
    /* ... other callbacks ... */
};

Normal Process Priority and Virtual Time

CFS maps the traditional nice value (‑20 … 19) to a weight; lower nice means higher weight and thus a larger share of CPU time. The kernel provides a lookup table sched_prio_to_weight for this conversion.

const int sched_prio_to_weight[40] = {
    /* -20 */ 88761, 71755, 56483, 46273, 36291,
    /* -15 */ 29154, 23254, 18705, 14949, 11916,
    /* -10 */  9548,  7620,  6100,  4904,  3906,
    /*  -5 */  3121,  2501,  1991,  1586,  1277,
    /*   0 */  1024,   820,   655,   526,   423,
    /*   5 */   335,   272,   215,   172,   137,
    /*  10 */   110,    87,    70,    56,    45,
    /*  15 */    36,    29,    23,    18,    15,
};

The virtual runtime (vruntime) is calculated as: vruntime = wall_time * NICE_0_LOAD / weight where NICE_0_LOAD corresponds to the weight of a nice‑0 task (1024). This ensures that tasks with higher weight progress slower in virtual time, giving them more actual CPU time.

Key Kernel Functions

Scheduling period is computed by __sched_period(), which adapts based on the number of runnable tasks ( nr_running) and the system’s latency settings.

static u64 __sched_period(unsigned long nr_running) {
    if (unlikely(nr_running > sched_nr_latency))
        return nr_running * sysctl_sched_min_granularity;
    else
        return sysctl_sched_latency;
}

Virtual‑time updates use calc_delta_fair(), which calls __calc_delta() to avoid floating‑point arithmetic:

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se) {
    if (unlikely(se->load.weight != NICE_0_LOAD))
        delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
    return delta;
}

The main CFS callbacks are defined in fair_sched_class:

const struct sched_class fair_sched_class = {
    .next          = &idle_sched_class,
    .enqueue_task  = enqueue_task_fair,
    .dequeue_task  = dequeue_task_fair,
    .check_preempt_curr = check_preempt_wakeup,
    .pick_next_task = pick_next_task_fair,
    .task_tick      = task_tick_fair,
    .update_curr    = update_curr_fair,
    /* ... other callbacks ... */
};

During each scheduler tick, task_tick_fair() updates runtime statistics, checks for preemption, and may trigger a reschedule.

Task Enqueue/Dequeue

When a task becomes runnable, enqueue_task_fair() inserts its sched_entity into the CFS red‑black tree; when it blocks, dequeue_task_fair() removes it. Both functions also update load weights and the tree’s left‑most node, which caches the entity with the smallest vruntime.

Task Selection

The pick_next_task_fair() function selects the left‑most entity (minimum vruntime) from the tree, sets it as the current task, and prepares the high‑resolution timer for its remaining slice.

Process Creation

When a new process is forked, task_fork_fair() initializes its vruntime so that the child is placed appropriately in the red‑black tree.

Overall Flow

The scheduler loop calls schedule(), which iterates over scheduling classes in priority order, invoking each class’s pick_next_task. For CFS, this ultimately chooses the task with the smallest vruntime, ensuring proportional CPU distribution based on weight.

In summary, the Linux CFS scheduler achieves fair, weight‑based CPU sharing by maintaining a red‑black tree of sched_entity structures, continuously updating virtual runtimes, and selecting the task with the minimal virtual time for execution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scheduler Linux CFS

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.