Operations 7 min read

Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide

After a midnight CPU alarm threatened service stability, I walked through rapid diagnosis with top and htop, identified JVM bottlenecks using jstat and async‑profiler, refactored a Java sorting algorithm, added caching, optimized database queries, containerized the service, and set up Prometheus‑Grafana alerts to prevent future incidents.

Efficient Ops
Efficient Ops
Efficient Ops
Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide

1. Initial Diagnosis: Quickly Locate the Problem

I logged into the server and ran top to view system resource usage. The output showed CPU usage near 100% and load average far exceeding the number of cores. $ top Next, I used htop for more detailed process information and discovered several Java processes consuming most of the CPU.

$ htop

2. JVM‑Level Analysis: Finding Hot Methods

Confirming the issue was in the Java application, I inspected the JVM with jstat to check GC activity. $ jstat -gcutil [PID] 1000 10 The output indicated frequent Full GC, a possible cause of high CPU usage.

I generated a thread dump using jstack and saw many threads in RUNNABLE state executing similar methods. $ jstack [PID] > thread_dump.txt To pinpoint hot methods, I ran async‑profiler and produced a flame graph that highlighted a custom sorting algorithm as the main CPU consumer.

$ ./profiler.sh -d 30 -f cpu_profile.svg [PID]

3. Application‑Layer Optimization: Refactoring the Algorithm

The culprit was a custom sort designed for small data sets but now handling large volumes. I rewrote it using Java 8 parallel streams:

List<Data> sortedData = data.parallelStream()
    .sorted(Comparator.comparing(Data::getKey))
    .collect(Collectors.toList());

I also added a cache to avoid repeated calculations:

@Cacheable("sortedData")
public List<Data> getSortedData() {
    // optimized sorting logic
}

4. Database Optimization: Indexes and Query Improvements

During the investigation I found inefficient SQL queries. Using EXPLAIN revealed a full‑table scan on a large table.

EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';

I created an appropriate index and rewrote part of the ORM query to use native SQL:

CREATE INDEX idx_status ON large_table(status);
@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);

5. Deployment Optimization: Container Isolation

To prevent a single service from affecting the whole system, I containerized the application with Docker:

FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]

Using Docker Compose I limited CPU and memory resources:

version: '3'
services:
  myapp:
    build: .
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M

6. Monitoring & Alerting: Proactive Protection

Finally, I upgraded the monitoring stack with Prometheus and Grafana and added a smarter alert rule for high CPU usage:

- alert: HighCPUUsage
  expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage detected"
    description: "CPU usage is above 80% for more than 5 minutes"

Conclusion: Crisis and Growth

After nearly four hours of intensive work, the system recovered, CPU usage dropped below 30%, and response times returned to milliseconds. The incident reinforced the importance of regular code reviews, performance testing, pressure testing, and a robust monitoring and alerting system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringDockerPrometheusJava performanceCPU troubleshooting
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.