Operations 10 min read

How to Tame a Rogue Backup Script That Crushed CPU on a Production Server

A production engineer receives a P2 CPU‑load alert, diagnoses the offending Python backup script by inspecting top and ps outputs, discovers it was compressing a 350 GB log directory, and resolves the issue with a forced kill and post‑mortem best‑practice advice.

Open Source Linux
Open Source Linux
Open Source Linux
How to Tame a Rogue Backup Script That Crushed CPU on a Production Server

Grafana raised a P2 alert: user-profile-service on an 8‑core host had a 5‑minute CPU load average > 10.

Step 1 – Identify the offending process

SSH to the server and run top:

ssh [email protected]
top

The first line of the top output shows a process with PID 21588 consuming 100 % of a CPU core, 12.5 GB virtual memory and 1.1 GB resident memory, running as user app_dev:

PID USER   PR  NI   VIRT   RES   SHR S %CPU %MEM    TIME+ COMMAND
21588 app_dev 20   0 12.5g 1.1g 2212 R 100.0 3.5  2:15.88 /usr/bin/python3 /opt/scripts/dev_backup.sh

Step 2 – Confirm details with ps

Exit top (press q) and run: ps aux | grep 21588 The output confirms the full command line and that the process was started from an interactive SSH session ( pts/0), not as a daemon.

Step 3 – Inspect the script

Display the script /opt/scripts/dev_backup.sh:

#!/usr/bin/python3
import os, tarfile, time
print("--- 开始备份用户数据(临时)---")
source_dir = "/var/log/app/user_profile"
target_file = f"/tmp/backup-{int(time.time())}.tar.gz"
try:
    with tarfile.open(target_file, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))
    print(f"--- 备份完成: {target_file} ---")
except Exception as e:
    print(f"备份失败: {e}")

The script uses the Python tarfile module in w:gz mode to compress the directory /var/log/app/user_profile. Compression with gzip is CPU‑intensive.

Step 4 – Determine data size

Check the size of the target directory: du -sh /var/log/app/user_profile The command returns 350 GB , far larger than the visible log files. Compressing 350 GB explains the sustained 100 % CPU usage.

Step 5 – Terminate the runaway process

A normal kill 21588 (SIGTERM) does not stop the script because it is busy in kernel‑level I/O. Force termination with SIGKILL: sudo kill -9 21588 The process disappears from top and the load average drops rapidly.

Step 6 – Post‑mortem recommendations

Never run untested scripts on production servers.

Use kill -9 only as a last resort, since it gives the process no chance to clean up.

Run heavy, temporary jobs with low CPU priority using nice: nice -n 19 python3 /opt/scripts/dev_backup.sh This sequence demonstrates a systematic approach to diagnosing high CPU load, interpreting top and ps output, verifying script behavior, and safely terminating a misbehaving process.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LinuxCPUprocessShell
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.