CFS Group Scheduling: Purpose, Configuration, and Kernel Implementation Details
The article explains why Linux’s Completely Fair Scheduler introduced group scheduling, how Android configures task groups via cpu.shares and Process.java, and details the kernel structures (task_group, sched_entity, cfs_rq) and algorithms for weight calculation, load measurement, propagation, and hierarchical load balancing.
This article introduces the CFS (Completely Fair Scheduler) group scheduling feature in the Linux kernel, explaining why it was added, how to configure it, and the key implementation details.
1. Reason for CFS group scheduling
The goal is to allocate a controllable proportion of CPU resources to different task groups under high load, ensuring fairness between users or between foreground and background workloads. For example, without grouping, a user with many tasks could dominate 90% of CPU time while another gets only 10%.
2. Group state on mobile devices
Task groups are represented by struct task_group objects under /dev/cpuctl . Important points:
Root group's cpu.shares defaults to 1024 and cannot be changed; its load is not updated.
The root group itself is a task group; tasks in the root are already grouped.
Kernel threads run in the root group and obtain time slices directly from the root cfs_rq .
With the default cpu.shares , a nice‑0 task in the root receives a time slice equal to the sum of slices of all other groups.
Task and task‑group weights are both based on “shares”, but task weight comes from its priority while group weight comes from the cpu.shares file.
3. Configuring task groups
In Android, the interface Process.java provides setProcessGroup(int pid, int group) to move a process into a specific group (e.g., THREAD_GROUP_TOP_APP or THREAD_GROUP_BACKGROUND ). The JSON file task_profiles.json defines AggregateProfiles that map a group to a SCHED_SP_* value; for example, THREAD_GROUP_TOP_APP maps to SCHED_SP_TOP_APP , whose MaxPerformance attribute adds the task to the cpu cgroup’s top‑app group.
Besides the CPU cgroup, other subsystems (cpuset, blkio, freezer) may also be affected when a group is created.
4. Kernel implementation overview
The implementation relies on several core structures:
struct task_group : represents a CPU cgroup and, when FAIR_GROUP_SCHED is enabled, a CFS task group.
struct sched_entity : used for both task‑level ( tse ) and group‑level ( gse ) entities.
struct cfs_rq : the CFS run‑queue, used both per‑CPU and per‑group.
Key fields:
css : weight (shares) of a task group, default 1024.
se : per‑entity pointer to its run‑queue ( my_q ).
cfs_rq : holds aggregated load, runnable count, and hierarchy load ( h_load ).
5. Task‑group weight calculation
The weight of a group is distributed to each group‑se ( gse ) on a CPU according to the formula:
ge->load.weight = tg->weight * grq->load.weight / Σ grq->load.weight
When a group has no runnable tasks, the denominator uses max(grq->load.weight, grq->avg.load_avg) to avoid division by zero.
Each task‑se ( tse ) then receives a portion of its group‑se’s weight proportionally to its own weight.
6. Load calculation (PELT)
Load is measured on two timelines:
Virtual‑time timeline ( rq->clock_task ) for runnable and util metrics.
PELT timeline ( rq->clock_pelt ) that scales time by CPU capacity and frequency, providing a frequency‑aware load value.
Formulas:
load_avg = runnable% * scale_load_down(load)
runnable_avg = runnable% * SCHED_CAPACITY_SCALE
util_avg = running% * SCHED_CAPACITY_SCALE
These values are stored in struct sched_avg embedded in both sched_entity and cfs_rq .
7. Load propagation
When a task is added or removed, the kernel marks the propagate flag in the affected cfs_rq . The propagation path is:
Update gse and grq load, runnable, and util values.
Push the changes upward through parent cfs_rq structures until the root.
Load propagation differs for different metrics: runnable and util flow from grq → gse , while load flows from gse → grq using load_sum .
8. Hierarchical load (h_load)
To support load‑balancing across CPUs, each cfs_rq maintains a hierarchy load ( h_load ) computed as:
h_load_child = h_load_parent × (child_load / parent_load)
For a task‑se, the formula becomes:
tse_h_load = grq_h_load × (tse_load_avg / grq_load_avg)
This value is used by the load‑balancing code (e.g., detach_tasks() ) to decide how much load to migrate.
9. Summary
The article explains why CFS group scheduling was introduced, how to configure it via cpu.shares , and the essential kernel structures and algorithms that make it work. In older Android versions, background groups were given a very low share (e.g., 52) to heavily favor foreground tasks, while newer releases tend to set all groups to the default 1024 for fairer CPU distribution.
OPPO Kernel Craftsman
Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.