Understanding CPU Power Management and the schedutil cpufreq Governor
Understanding CPU power management involves static and dynamic power, the cpufreq framework’s core, governor, driver, and stats components, and common governors, while the schedutil governor—introduced in Linux 4.7—leverages scheduler utilization data, fast‑switching, and tunable parameters to compute and apply per‑cluster frequencies instantly for low‑latency, fine‑grained scaling.
1. CPU Power Management Overview
If energy were unlimited, complex power‑management control would not be necessary, especially on embedded devices such as smartphones where performance and battery life are tightly coupled. CPU power management therefore requires close cooperation between hardware and software.
CPU power management mainly addresses two types of power consumption:
Static power: Leakage current in transistors when the SoC is idle. Process‑node improvements (e.g., moving from 7nm FinFET to 5nm EUV) reduce static power by about 30%.
Dynamic power: Consumption that varies with workload. It is controlled by mechanisms such as cpufreq , cpuidle , and lock mechanisms, as well as ASIC techniques like clock gating and power domains.
cpufreq selects a voltage/frequency pair (OPP) based on CPU load. On ARM platforms with DynamIQ, clusters can contain heterogeneous cores (e.g., 1 little + 3 medium + 4 big cores) and each cluster shares a common voltage/frequency pair.
cpuidle uses C‑states (C0‑C3) to describe idle power states. Transition between states is performed by executing the WFI instruction.
Spinlock uses the WFE instruction to enter a low‑power state while waiting for a SEV event.
Clock gating is an ASIC feature that disables clocks for idle hardware blocks.
Power domain partitions allow fine‑grained power control, e.g., separate domains for big/medium cores and little cores on the Pixel 4.
2. cpufreq Framework Overview
The cpufreq framework consists of several core components:
Core : abstracts the generic flow and methods.
Governor : implements the frequency‑scaling policy.
Driver : hardware‑specific implementation that actually changes the frequency.
Stats : gathers runtime statistics such as time‑in‑state.
Notifier : a notification chain that informs other drivers about frequency changes.
Sysfs : exposes user‑space interfaces for configuring the governor.
The framework reflects common software design ideas such as layering, separation of mechanism and policy, and the observer pattern.
2.1 Frequency‑scaling Methods and Influencing Factors
If the driver implements setpolicy , the hardware can auto‑scale and the governor is not needed; otherwise the governor must compute a target frequency and invoke one of the driver’s target / targetindex / fastswitch methods.
Frequency decisions are primarily driven by CPU load, but thermal limits and user‑space policies also affect the max / min policy values.
2.2 Common Governors
Performance : always selects the maximum frequency.
Powersave : always selects the minimum frequency.
Userspace : frequency is set via the scaling_setspeed sysfs node.
Ondemand : samples load periodically and jumps to max frequency when a threshold is exceeded.
Conservative : similar to Ondemand but changes frequency gradually.
Interactive : Android‑specific, registers an idle notifier and reacts aggressively to CPU‑intensive tasks.
schedutil : the focus of this article; integrates directly with the scheduler’s load tracking.
3. schedutil Overall Framework
The schedutil governor was introduced by Rafael J. Wysocki in 2016 and merged into Linux 4.7. It leverages the scheduler’s utilization data (PELT or WALT) to make frequency decisions, eliminating the need for periodic sampling.
Key design points:
Registers a callback with the scheduler to receive load updates instantly.
Supports fast‑switching from interrupt context when the driver provides cpufreq_driver_fast_switch .
Interaction with other kernel components:
Scheduler : provides utilization data via cpufreq_update_util .
SchedTune : uses cgroup‑based boost values (later replaced by UClamp).
Tunables : sysfs parameters such as up_rate_limit_us , down_rate_limit_us , hispeed_load , and hispeed_freq allow runtime tuning.
3.1 Initialization and Startup
The governor is initialized in cpufreq_set_policy via the sugov_init function, which allocates a struct sugov_policy , attempts to enable fast‑switching, and creates a kthread or workqueue for the slow path.
sugov_start registers the core frequency‑selection callback with the scheduler and checks policy limits.
3.2 Frequency‑limit Checks (sugov_limits)
Depending on whether fast‑switching is available, the governor either calls cpufreq_driver_fast_switch (interrupt context) or queues work to be processed by __cpufreq_driver_target (slow path).
4. Core Logic of schedutil
The governor’s main work is performed in sugov_update_shared (multiple CPUs per policy) or sugov_update_single (single‑CPU policy). The workflow includes:
Update iowait boost.
Determine whether a frequency change is needed ( sugov_should_update_freq ).
Obtain the current utilization ( sugov_get_util ), applying iowait and SchedTune boosts.
Normalize utilization against the maximum capacity and apply tunable parameters.
Compute the next frequency ( get_next_freq ).
Commit the frequency change via sugov_update_commit .
4.1 Frequency Calculation (get_next_freq)
For ARM, the target frequency is approximated by:
freq_next = 1.25 * freq_max * util / capacity_max
The factor 1.25 provides a safety margin; the 0.8 utilization threshold determines when to scale up.
4.2 Shared‑CPU Frequency Selection (sugov_next_freq_shared)
When several CPUs share a policy, the governor selects the CPU with the highest utilization‑to‑capacity ratio as the reference for the whole cluster.
4.3 Frequency Commit (sugov_update_commit)
If the target frequency differs from the current one and the required time interval has elapsed, the governor either performs a fast switch or queues work for the slow path.
5. Conclusion
The article provides a detailed walkthrough of CPU power‑management mechanisms, the cpufreq framework, and the inner workings of the schedutil governor, illustrating how modern Linux kernels achieve fine‑grained, low‑latency frequency scaling.
Reference materials include kernel documentation, LWN articles, ARM TRM, and various Chinese technical blogs.
OPPO Kernel Craftsman
Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.