Operations 9 min read

Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?

A production incident on an elastic‑cloud deployment revealed that setting the JVM Metaspace limit to 64 MiB, while the application required around 76 MiB, triggered continuous Full GC, causing stop‑the‑world pauses, full‑line time‑outs, and a costly rollback.

Efficient Ops
Efficient Ops
Efficient Ops
Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?

1. Background

Service is deployed on an elastic cloud built on Docker containers. The application “Maybach” runs on five machines released in three batches: two machines in the first two batches and three machines in the third batch.

2. Incident Timeline

14:34 – first batch deployed, success at 14:35.

14:45 – second batch deployed, success at 14:46.

Service on the first two machines runs normally, logs are printed, no alerts.

14:58 – third batch deployment starts.

14:59‑15:02 – three machines of the third batch are deployed one by one.

15:02 – business‑level alert: payment success rate drops.

15:04 – other metrics show degradation.

15:06 – callers report connection time‑outs for the newly released service.

The system went completely offline after the third batch, and rollback took half an hour.

3. Investigation

Code changes were examined; the main suspect was a JVM parameter change. No GC logs were configured, so the team reproduced the issue in a pre‑release environment.

Using

top -H -p433

it was observed that the service process (pid 433) consumed high CPU. The thread with nid 0x1de was identified as the “VM Thread”. The thread dump showed many “Gang worker” threads and indicated heavy GC activity.

<code>"VM Thread" os_prio=0 tid=0x00007f95f825b800 nid=0x1de runnable
"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x00007f95f801c800 nid=0x1b3 runnable
"Gang worker#1 (Parallel GC Threads)" os_prio=0 tid=0x00007f95f801e000 nid=0x1b4 runnable
... (additional threads omitted) ...
JNI global references: 10200
</code>

Further GC monitoring revealed continuous Full GC (FGC). Although the young and old generations used little memory, Metaspace occupied most of the 64 MiB limit (≈64 MiB used out of 65 MiB). When Metaspace exceeded the configured

-XX:MaxMetaspaceSize=64m

, the JVM entered frequent FGC, causing stop‑the‑world pauses and request time‑outs.

Normal operation shows Metaspace around 76 MiB, which is already above the 64 MiB limit, so each request pushes Metaspace over the threshold, triggering FGC and service stalls.

4. Root Cause

The deployment set

-XX:MaxMetaspaceSize=64m

while the application normally needs ~76 MiB. Once Metaspace crossed 64 MiB, the JVM performed continuous FGC, leading to stop‑the‑world pauses and total request time‑outs.

5. Why No Alerts Before the Third Batch?

Before the third batch, most traffic was still directed to the two machines that had not been upgraded, so the limited time‑outs on the upgraded machines did not exceed the alert threshold (more than ten errors per second).

6. Why Full Outage After the Third Batch?

After the third batch, traffic shifted to the upgraded machines, all of which were experiencing stop‑the‑world pauses, causing every request to time out and the whole line to go down.

7. Metaspace Overview

Metaspace replaces PermGen in Java 8 and uses native memory. Its size is limited only by the host’s memory. Relevant JVM options include:

-XX:MetaspaceSize

: initial threshold for Metaspace GC (≈20 MiB by default).

-XX:MaxMetaspaceSize

: maximum Metaspace size (unlimited by default).

-XX:MinMetaspaceFreeRatio

and

-XX:MaxMetaspaceFreeRatio

: control free space percentages after GC.

JVMoperationsMetaspaceGCIncident AnalysisElastic Cloud
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.