Root Cause Analysis of Linux Kernel Hard Lockup on CPU 51
This article walks through a real Linux kernel hard lockup case, explaining what hard lockup is, analyzing stack traces and register values, identifying a spinlock contention on a per‑CPU runqueue, and showing how an inappropriate GFP flag caused interrupts to be enabled at the wrong time, leading to a deadlock and the eventual fix.
Background
Business side reported a machine crash that generated a vmcore but showed no hardware anomalies. dmesg indicated a hard LOCKUP on CPU 51 causing a panic.
<code>[4664383.183725] NMI watchdog: Watchdog detected hard LOCKUP on cpu 51
[4664383.183750] Call Trace:
[4664383.183750] _raw_spin_lock+0x1f/0x30
[4664383.183750] raw_spin_rq_lock_nested+0x13/0x20
[4664383.183750] online_fair_sched_group+0x45/0x120
[4664383.183750] sched_online_group+0xec/0x110
[4664383.183751] sched_autogroup_create_attach+0xc2/0x1d0
[4664383.183751] ksys_setsid+0xe9/0x110
[4664383.183751] __ia32_sys_setsid+0xe/0x20
[4664383.183751] do_syscall_64+0x47/0x140
[4664383.183752] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[4664383.183752] RIP: 0033:0x7f89d3b8afdb
[4664383.183754] Kernel panic - not syncing: Hard LOCKUP</code>Q1: What is hard lockup?
Linux defines two lockup states:
soft lockup– the CPU stays in kernel mode too long, detected by a watchdog thread that checks timestamps; and
hard lockup– the CPU stops responding to high‑resolution timer (hrtimer) interrupts, detected by periodic non‑maskable interrupts (NMI) that verify the timer counter is still increasing.
Q2: Why does CPU 51 stop responding to hrtimer interrupts?
Examining the stack on CPU 51 shows it was in
online_fair_sched_groupwith interrupts disabled, waiting for
raw_spin_rq_lockto be acquired, which eventually triggered the hard lockup.
<code>crash> bt
PID: 1000877 TASK: ff1101f9fda14000 CPU: 51 COMMAND: "crond"
#0 [fffffe0000b66960] machine_kexec at ffffffff810625ff
#1 [fffffe0000b669b8] __crash_kexec at ffffffff8113bb72
#2 [fffffe0000b66a88] panic at ffffffff81c51a2b
#12 [fffffe0000b66ef0] end_repeat_nmi at ffffffff81e01400
[exception RIP: native_queued_spin_lock_slowpath+330]
RIP: ffffffff810f080a RSP: ffa000003b62fe28 RFLAGS: 00000046 <=== interrupts disabled</code>RFLAGS 0x46 shows the interrupt flag (bit 9) is cleared.
Q3: What lock is involved?
Register analysis shows RDI = ff1101fb83a2ee40, which is the address of
rq->__lock, the spinlock protecting the per‑CPU runqueue structure
struct rq.
<code>crash> struct rq -o
struct rq {
[0] raw_spinlock_t __lock; // first member
[4] unsigned int nr_running;
[8] unsigned int bt_nr_running;
...
}</code>The lock belongs to CPU 56’s runqueue, as confirmed by:
<code>crash> p runqueues |grep ff1101fb83a2ee40
[56]: ff1101fb83a2ee40 <=== CPU 51 is waiting on CPU 56’s runqueue</code>Q4: Who holds the lock?
Stack trace of CPU 56 shows it is in the timer interrupt path, inside
finish_task_switchand subsequently in
perf_event_task_sched_in, which eventually calls
kmem_cache_allocwith
GFP_KERNEL. This allocation enables interrupts (via
local_irq_enable()) while still holding the runqueue lock.
<code>static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
if (gfpflags_allow_blocking(flags))
local_irq_enable(); // enables interrupts
...
if (gfpflags_allow_blocking(flags))
local_irq_disable(); // disables again
}</code>The NMI that arrives right after
local_irq_enable()tries to acquire the same
rq->__lock, causing a deadlock.
Q5: Why is interrupt enabled in the context of a context switch?
The path
finish_task_switch → perf_event_task_sched_in → intel_pmu_lbr_add → kmem_cache_alloc(GFP_KERNEL)sets the
GFP_KERNELflag, which includes
__GFP_DIRECT_RECLAIM. The allocator therefore enables interrupts before the lock is released.
Q6: How does this lead to a hard lockup?
When the interrupt is enabled while still holding the runqueue spinlock, the incoming timer interrupt also attempts to acquire the same lock, resulting in a circular wait and a hard lockup.
Fix
Changing the allocation flag from
GFP_KERNELto
GFP_ATOMICprevents interrupts from being enabled in this critical section.
<code>diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
@@ -700,7 +700,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
- cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL);
+ cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_ATOMIC);
}</code>The upstream fix instead moves the allocation out of the context switch path.
In summary, an inappropriate memory‑allocation flag caused interrupts to be enabled while a per‑CPU runqueue spinlock was held, leading to a deadlock and hard lockup. Adjusting the flag or restructuring the code eliminates the issue.
Tencent Architect
We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.