How to Tame a Rogue Backup Script That Crushed CPU on a Production Server
A production engineer receives a P2 CPU‑load alert, diagnoses the offending Python backup script by inspecting top and ps outputs, discovers it was compressing a 350 GB log directory, and resolves the issue with a forced kill and post‑mortem best‑practice advice.
Grafana raised a P2 alert: user-profile-service on an 8‑core host had a 5‑minute CPU load average > 10.
Step 1 – Identify the offending process
SSH to the server and run top:
ssh [email protected]
topThe first line of the top output shows a process with PID 21588 consuming 100 % of a CPU core, 12.5 GB virtual memory and 1.1 GB resident memory, running as user app_dev:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21588 app_dev 20 0 12.5g 1.1g 2212 R 100.0 3.5 2:15.88 /usr/bin/python3 /opt/scripts/dev_backup.shStep 2 – Confirm details with ps
Exit top (press q) and run: ps aux | grep 21588 The output confirms the full command line and that the process was started from an interactive SSH session ( pts/0), not as a daemon.
Step 3 – Inspect the script
Display the script /opt/scripts/dev_backup.sh:
#!/usr/bin/python3
import os, tarfile, time
print("--- 开始备份用户数据(临时)---")
source_dir = "/var/log/app/user_profile"
target_file = f"/tmp/backup-{int(time.time())}.tar.gz"
try:
with tarfile.open(target_file, "w:gz") as tar:
tar.add(source_dir, arcname=os.path.basename(source_dir))
print(f"--- 备份完成: {target_file} ---")
except Exception as e:
print(f"备份失败: {e}")The script uses the Python tarfile module in w:gz mode to compress the directory /var/log/app/user_profile. Compression with gzip is CPU‑intensive.
Step 4 – Determine data size
Check the size of the target directory: du -sh /var/log/app/user_profile The command returns 350 GB , far larger than the visible log files. Compressing 350 GB explains the sustained 100 % CPU usage.
Step 5 – Terminate the runaway process
A normal kill 21588 (SIGTERM) does not stop the script because it is busy in kernel‑level I/O. Force termination with SIGKILL: sudo kill -9 21588 The process disappears from top and the load average drops rapidly.
Step 6 – Post‑mortem recommendations
Never run untested scripts on production servers.
Use kill -9 only as a last resort, since it gives the process no chance to clean up.
Run heavy, temporary jobs with low CPU priority using nice: nice -n 19 python3 /opt/scripts/dev_backup.sh This sequence demonstrates a systematic approach to diagnosing high CPU load, interpreting top and ps output, verifying script behavior, and safely terminating a misbehaving process.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
