How to Automate Coredump Detection and Debugging in OpenBMC
This article explains how the STE team built an integrated workflow for automatic coredump perception, collection, reporting, and analysis in OpenBMC, covering BMC fundamentals, OpenBMC architecture, pain points, offline debugging with IPK packages, and future automation enhancements.
Background
Baseboard Management Controller (BMC) is an independent micro‑controller on servers that provides 24/7 hardware monitoring and management, handling tasks such as health monitoring, fan control, remote KVM, event logging, and firmware updates.
1. BMC Overview
Hardware monitoring and diagnostics for CPU, memory, storage, fans, temperature sensors.
Thermal management by adjusting fan speed.
Remote management via KVM for power control and OS installation.
Event logging and alerting.
Firmware update capability for BIOS, CPLD, power, retimer, etc.
2. OpenBMC Introduction
OpenBMC aims to provide an open, flexible, customizable BMC solution built on Yocto, with Systemd and D‑Bus as the user‑space foundation.
3. Pain Points
In a 24/7 monitoring environment, rare coredumps can cause unpredictable consequences and are hard to reproduce, consuming significant manpower and time, especially when the affected machines are not directly accessible.
Coredump Automatic Detection and Debugging Practice
The STE team built an integrated workflow that perceives, collects, reports, and automatically analyses coredumps, enabling faster fault detection, log collection, and environment recreation.
Perception
systemd‑coredump daemon monitors Linux coredumps and captures them.
Internal debug‑collector daemon watches for coredump events and triggers the collection process.
Collecting coredumps
When a coredump event is detected inside the BMC firmware, the following logs are gathered:
Coredump file (memory snapshot of the crashed process).
Journal logs containing the PID and related system messages.
os‑release information indicating firmware version.
Additional system and application logs.
After collection, logs are packaged and pushed to a designated server, marked as a coredump, and then forwarded to an alert platform.
Offline Debugging
Two approaches are described:
Legacy solution : Clone the source code at the coredump‑producing version, compile locally, and manually copy required debug dependencies.
Current solution : Use IPK packages (opkg) to package all required binaries and debug symbols. Yocto can generate these IPK packages and resolve dependencies automatically.
<code>$ bitbake package-index</code>Prepared IPK packages are stored in $BUILDDIR/tmp/deploy/ipk and can be uploaded to a remote file server for CI integration.
Debug Execution
Debugging is performed with GDB using the prepared rootfs and the core file.
Automation
Dump analysis service consumes coredump log URLs and runs the debug_dump script.
Alert bot generates one‑click links to the analysis service.
Web UI allows users to trigger the script and perform GDB analysis directly in a browser.
<code>~ ./debug_dump.py -u https://<path/to/your/ipk/source>/bmc_dump/obmcdump_coredump_22_67.tar.gz
INFO:debug_dump:Found core execfn /lib/systemd/systemd-journald
INFO:debug_dump:Downloading from https://<path/to/your/ipk/source>/bmc_ipk/ipks.tar to /tmp/ipkdbg_n_zlxpd3/ipks.tar
...
Installing systemd (250.3) on root.
...
Core was generated by `/lib/systemd/systemd-journald'.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:45
#1 0x76be7fd0 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#2 0x76bd2428 in __GI_abort () at abort.c:79
...</code>Conclusion
The workflow dramatically improves coredump handling efficiency, reducing manpower and time costs while accelerating fault isolation. Future work includes intelligent analysis with machine learning, expanding automation tools, enhancing real‑time monitoring, and collaborating with the OpenBMC community.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.