Operations 10 min read

Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC

The article recounts a staged rollout of the Maybach service on elastic cloud, details the timeline of successful and failing deployments, analyzes JVM metrics revealing excessive Metaspace usage that triggered continuous full garbage collections, and explains how this caused system‑wide timeouts and a half‑hour outage.

Efficient Ops

Sep 23, 2021

Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC

Background

The service was deployed on an elastic cloud platform, which uses Docker‑based virtual machines rather than physical servers. The Maybach application was released on five machines in three batches: the first two batches each contained one machine, and the third batch contained three machines.

Phenomenon Description

14:34 – first batch deployment started; 14:35 – first machine deployed successfully.

14:45 – second batch deployment started; 14:46 – second machine deployed successfully.

After the first two machines were online, the service ran normally, processes were alive, logs were printed, and no alarms were triggered.

14:58 – third batch deployment started.

14:59 – first machine of the third batch deployed successfully.

15:01 – second machine of the third batch deployed successfully.

15:02 – third machine of the third batch deployed successfully.

15:02 – payment success rate alarm triggered.

15:04 – several business metrics showed degradation and latency increase.

15:06 – callers reported that all newly deployed services timed out.

At this point the newly released system was completely down, leading to a rollback that took about half an hour. The rollback finished at 15:31, causing significant loss.

Machine monitoring after deployment (example of the first machine):

RPC call statistics

Outbound traffic

Request traffic

Investigation and Analysis

Reviewing the code changes (MR records) indicated a modification to JVM parameters. The first step was to check GC logs, but the Maybach service had no gc.log configured, so the team reproduced the issue in a pre‑release environment.

Using top to inspect system status revealed that the service process (pid=433) consumed a large amount of CPU. The command used was: top -H -p433 The output showed that thread 0x1de (nid=0x1de) was the CPU hog. Examining its stack with jstack produced:

"VM Thread" os_prio=0 tid=0x00007f95f825b800 nid=0x1de runnable
"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x00007f95f801c800 nid=0x1b3 runnable
"Gang worker#1 (Parallel GC Threads)" os_prio=0 tid=0x00007f95f801e000 nid=0x1b4 runnable
... (other GC threads) ...
JNI global references: 10200

The "VM Thread" is responsible for thread creation, allocation, and object cleanup. The presence of many GC worker threads indicated heavy garbage‑collection activity. The root cause was identified as excessive Full GC (FGC) caused by Metaspace pressure.

Further Metaspace monitoring showed that the total Metaspace size was 65536 KB, with 64098 KB already used. The JVM was started with -XX:MaxMetaspaceSize=64 MB, meaning that once usage exceeded 64 MB, FGC would be triggered. In normal operation the service used about 76 MB of Metaspace, so every request caused Metaspace to exceed the limit, leading to continuous FGC, stop‑the‑world pauses, and ultimately request timeouts.

Therefore, the failure reason is clear: the deployment set -XX:MaxMetaspaceSize=64 MB, but the application normally requires ~76 MB, causing perpetual Full GC and system‑wide timeouts.

Why No Alarms Before the Third Batch?

Before the third batch was deployed, most traffic was still directed to the three machines that had not yet been updated, so the two updated machines handled only a small amount of traffic. The alarm threshold (more than ten errors per second) was not reached, so no alerts were generated.

Why Did All Connections Time Out After the Third Batch?

After the third batch went live, traffic shifted to the three newly updated machines, which were all experiencing FGC‑induced pauses. The remaining two machines suddenly received a large surge of traffic, but they also timed out, causing a complete outage.

Additional Information About Metaspace

Metaspace, introduced in Java 8, replaces PermGen and uses native memory instead of heap memory, so its size is limited only by the host's physical memory. -XX:MetaspaceSize: initial (and minimum) threshold for Metaspace GC, default around 20 MB. -XX:MaxMetaspaceSize: maximum Metaspace size; unlimited by default. -XX:MinMetaspaceFreeRatio: minimum percentage of free Metaspace after GC. -XX:MaxMetaspaceFreeRatio: maximum percentage of free Metaspace after GC.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JVM Operations Metaspace Full GC incident analysis

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.