Investigation and Resolution of CPU Spike in Elasticsearch Cluster Caused by JIT Deoptimization

The team traced near‑100 % CPU spikes in an Elasticsearch cluster during stress tests to frequent JIT deoptimizations triggered by a switch‑based experiment routing logic, and resolved the issue by replacing the switch with a function map, which eliminated deoptimizations and stabilized CPU usage.

HelloTech
HelloTech
HelloTech
Investigation and Resolution of CPU Spike in Elasticsearch Cluster Caused by JIT Deoptimization

Background

During a full‑link stress test, a few nodes of the ride‑sharing matching Elasticsearch (ES) cluster experienced near‑100% CPU usage. In a second round of testing, after disabling a recently launched H3 recall matching AB experiment, the CPU remained stable around 35%. Re‑enabling the experiment caused the same nodes to spike again, suggesting a link between the AB experiment and the CPU anomaly.

Problem Reproduction

The issue was reproduced by switching live traffic to one of the two ES clusters during a low‑traffic period and generating load from a pre‑release environment. The CPU on the affected ES node surged from 10% to 100% under load, as shown in the following chart:

System load (90%) was the primary consumer; network, disk, and memory metrics remained stable.

Flame graphs generated with Arthas revealed that most CPU waste was occurring in Deoptimization::uncommon_trap.

Background Knowledge: JIT Optimization and Deoptimization

Just‑In‑Time (JIT) compilation translates hot code paths into native machine code at runtime, applying aggressive optimizations. When runtime conditions change (e.g., code modifications, type changes), the previously compiled code may become invalid. The JVM then triggers Deoptimization::uncommon_trap, reverting execution to the interpreter or a less optimized version to regenerate correct native code.

Example code before deoptimization:

static void test(Object input) { 
    if (input == null) { 
        return; 
    } 
    // do something 
}

If input is non‑null for many iterations, the JIT optimizes the method by removing the null‑check. When a null appears later, the uncommon trap fires, causing a deoptimization.

Frequent deoptimizations can stem from:

Rapid code changes that invalidate optimization assumptions.

Dynamic type variations causing type‑mismatch handling.

Program characteristics such as heavy dynamic code generation or complex polymorphic calls.

Issue Identification & Solution

Analysis of hotspot threads during the abnormal period showed that the experiment‑group routing logic used a switch statement. The JVM identified this as hot code; after the JIT compiled one branch (e.g., experiment 1), traffic for other experiments caused the compiled assumptions to break, triggering deoptimization.

Solution: Replace the switch ‑based routing with a map of functions, eliminating the conditional branch and preventing the deoptimization.

HashMap<String, Function<Map<String, ScriptDocValues<?>>, Boolean>> methodMap = new HashMap<>();
methodMap.put("exp1", this::executeForExp1);
methodMap.put("exp2", this::executeForExp2);
methodMap.put("exp3", this::executeForExp3);
executeFunction = methodMap.get(h3Version);

After applying the change, repeat stress testing showed stable CPU usage and the issue was resolved.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

backendJavaperformanceElasticsearchJITCPUDeoptimization
HelloTech
Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.