Operations 12 min read

Analysis of a Linux Kernel Futex Bug Causing Java and Xtrabackup Hang on Hadoop Nodes

A detailed investigation reveals that a futex bug in Linux kernel 2.6.32-504 causes Java programs on Hadoop and xtrabackup processes to hang, and the issue can be resolved by upgrading to a newer kernel version.

58 Tech
58 Tech
58 Tech
Analysis of a Linux Kernel Futex Bug Causing Java and Xtrabackup Hang on Hadoop Nodes

Background – Recently, Hadoop operators reported that newly provisioned machines experienced Java program hangs during garbage collection, despite having sufficient memory. Disabling a monitoring script that periodically runs jstack also triggered hangs, and both jstack and pstack could revive the frozen processes.

Database administrators observed similar hangs in xtrabackup backups. GDB analysis pinpointed the root cause to a kernel bug.

Conclusion – The affected machines run Linux kernel 2.6.32-504, which contains a futex bug that can leave non‑shared‑lock code in an unrecoverable wait state.

Reference: Linux commit 76835b0e

The bug manifests when the following conditions are met:

The kernel version is older than 2.6.32-504.23.4.

The program uses non‑shared (private) futex locks.

The system has a multi‑core CPU with cache.

When all conditions hold, there is a probability of triggering the hang.

Solution – Upgrade the kernel to 2.6.32-504.23.4 or later to obtain the fix.

Root‑cause analysis steps

1. Obtain the process ID – The problematic xtrabackup process has PID 715765.

2. Inspect the kernel call stack – Execute cat /proc/715765/*/task/stack . Most threads are stuck in futex_wait_queue_me , which puts the thread to sleep until it is woken.

3. Analyze user‑space code with GDB

gdb attach 715765

The best tool for this kind of hang is GDB, which attaches to the live process to retrieve state information.

3.1 Thread information – Threads are mainly waiting on pthread_cond_wait (condition variable) or __lll_lock_wait (lock contention).

3.2 Per‑thread analysis – Three thread types are observed:

Copy thread: data_copy_thread_func

Compression thread: compress_wokrer_thread_func

IO thread: io_handler_thread

Understanding xtrabackup workflow clarifies why these threads interact.

3.3 Workflow description – The backup copies database files, optionally compresses them, and uses a control mutex to coordinate between copy and compression threads.

3.4 Lock analysis of copy thread #2 – Thread #2 hangs while acquiring ctrl_mutex before feeding data to the first compression thread; it waits for owner thread 715800 (thread #7).

3.5 Lock analysis of copy thread #7 – Thread #7 also hangs at a similar point, waiting for a third compression thread’s mutex whose owner is NULL, indicating a kernel‑level lock release failure rather than a classic deadlock.

3.6 How the control mutex is released – The mutex can be released in four places: during compression thread creation, during compression thread runtime, by the copy thread after feeding data, and when destroying the compression thread. Log analysis shows the hang occurs after multiple compression cycles, eliminating the first three cases and leaving the copy thread’s release as the likely culprit.

Based on the observed behavior, the most plausible explanation is that the kernel’s futex implementation fails to wake a waiting thread due to a missing memory barrier (mb) in the non‑shared lock path.

Futex bug analysis

The bug stems from the omission of two mb instructions (memory barriers) in the non‑shared lock branch, causing inconsistent state in futex_wake and preventing the wake‑up of a waiting thread.

Memory barriers ensure proper ordering of CPU instructions and prevent the compiler from reordering critical operations.

Inspection of pthread structures shows __kind equals 0, confirming the lock is non‑shared and thus affected by the bug.

Upgrading the kernel to 2.6.32-504.23.4 resolved the hang for both xtrabackup and Hadoop Java processes.

Conclusion – GDB proved effective for real‑time user‑space analysis, while kernel‑level diagnostics may benefit from tools like SystemTap. Sending a signal (e.g., via jstack , gdb , or pstack ) can wake threads stuck in futex_wait_queue_me because the function sets the task state to TASK_INTERRUPTIBLE , allowing interruption by signals.

LinuxGDBHadoopxtrabackupFutexkernel bughang
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.