Fundamentals 28 min read

CFS Group Scheduling: Purpose, Configuration, and Kernel Implementation Details

The article explains why Linux’s Completely Fair Scheduler introduced group scheduling, how Android configures task groups via cpu.shares and Process.java, and details the kernel structures (task_group, sched_entity, cfs_rq) and algorithms for weight calculation, load measurement, propagation, and hierarchical load balancing.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
CFS Group Scheduling: Purpose, Configuration, and Kernel Implementation Details

This article introduces the CFS (Completely Fair Scheduler) group scheduling feature in the Linux kernel, explaining why it was added, how to configure it, and the key implementation details.

1. Reason for CFS group scheduling

The goal is to allocate a controllable proportion of CPU resources to different task groups under high load, ensuring fairness between users or between foreground and background workloads. For example, without grouping, a user with many tasks could dominate 90% of CPU time while another gets only 10%.

2. Group state on mobile devices

Task groups are represented by struct task_group objects under /dev/cpuctl . Important points:

Root group's cpu.shares defaults to 1024 and cannot be changed; its load is not updated.

The root group itself is a task group; tasks in the root are already grouped.

Kernel threads run in the root group and obtain time slices directly from the root cfs_rq .

With the default cpu.shares , a nice‑0 task in the root receives a time slice equal to the sum of slices of all other groups.

Task and task‑group weights are both based on “shares”, but task weight comes from its priority while group weight comes from the cpu.shares file.

3. Configuring task groups

In Android, the interface Process.java provides setProcessGroup(int pid, int group) to move a process into a specific group (e.g., THREAD_GROUP_TOP_APP or THREAD_GROUP_BACKGROUND ). The JSON file task_profiles.json defines AggregateProfiles that map a group to a SCHED_SP_* value; for example, THREAD_GROUP_TOP_APP maps to SCHED_SP_TOP_APP , whose MaxPerformance attribute adds the task to the cpu cgroup’s top‑app group.

Besides the CPU cgroup, other subsystems (cpuset, blkio, freezer) may also be affected when a group is created.

4. Kernel implementation overview

The implementation relies on several core structures:

struct task_group : represents a CPU cgroup and, when FAIR_GROUP_SCHED is enabled, a CFS task group.

struct sched_entity : used for both task‑level ( tse ) and group‑level ( gse ) entities.

struct cfs_rq : the CFS run‑queue, used both per‑CPU and per‑group.

Key fields:

css : weight (shares) of a task group, default 1024.

se : per‑entity pointer to its run‑queue ( my_q ).

cfs_rq : holds aggregated load, runnable count, and hierarchy load ( h_load ).

5. Task‑group weight calculation

The weight of a group is distributed to each group‑se ( gse ) on a CPU according to the formula:

ge->load.weight = tg->weight * grq->load.weight / Σ grq->load.weight

When a group has no runnable tasks, the denominator uses max(grq->load.weight, grq->avg.load_avg) to avoid division by zero.

Each task‑se ( tse ) then receives a portion of its group‑se’s weight proportionally to its own weight.

6. Load calculation (PELT)

Load is measured on two timelines:

Virtual‑time timeline ( rq->clock_task ) for runnable and util metrics.

PELT timeline ( rq->clock_pelt ) that scales time by CPU capacity and frequency, providing a frequency‑aware load value.

Formulas:

load_avg = runnable% * scale_load_down(load)

runnable_avg = runnable% * SCHED_CAPACITY_SCALE

util_avg = running% * SCHED_CAPACITY_SCALE

These values are stored in struct sched_avg embedded in both sched_entity and cfs_rq .

7. Load propagation

When a task is added or removed, the kernel marks the propagate flag in the affected cfs_rq . The propagation path is:

Update gse and grq load, runnable, and util values.

Push the changes upward through parent cfs_rq structures until the root.

Load propagation differs for different metrics: runnable and util flow from grq → gse , while load flows from gse → grq using load_sum .

8. Hierarchical load (h_load)

To support load‑balancing across CPUs, each cfs_rq maintains a hierarchy load ( h_load ) computed as:

h_load_child = h_load_parent × (child_load / parent_load)

For a task‑se, the formula becomes:

tse_h_load = grq_h_load × (tse_load_avg / grq_load_avg)

This value is used by the load‑balancing code (e.g., detach_tasks() ) to decide how much load to migrate.

9. Summary

The article explains why CFS group scheduling was introduced, how to configure it via cpu.shares , and the essential kernel structures and algorithms that make it work. In older Android versions, background groups were given a very low share (e.g., 52) to heavily favor foreground tasks, while newer releases tend to set all groups to the default 1024 for fairer CPU distribution.

Load BalancingcgroupLinux kernelCFS schedulingcpu sharestask groups
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.