Operations 9 min read

Root Cause Analysis and Mitigation of JVM GC‑Induced OOM and Memory Fragmentation in a Containerized Hotel Pricing Service

This article details how long JVM garbage‑collection pauses and glibc ptmalloc memory‑fragmentation caused container OOM kills in a hotel‑pricing system, and explains the step‑by‑step diagnosis, JVM tuning, Kubernetes health‑check adjustments, and the replacement of ptmalloc with jemalloc to eliminate the issue.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Root Cause Analysis and Mitigation of JVM GC‑Induced OOM and Memory Fragmentation in a Containerized Hotel Pricing Service

The author, a member of the cloud‑native SIG at Qunar, joined the hotel‑pricing center in February 2020 and is responsible for real‑time and offline pricing modules. After the company fully containerized its services in November 2021, 98% of applications migrated to Kubernetes, but many were repeatedly killed.

Two main failure categories were identified: (1) prolonged JVM garbage‑collection (GC) pauses causing Kubernetes health‑check timeouts, and (2) memory fragmentation leading to OOM kills.

Problem 1 – Long GC

Docker containers restarted up to 29 times in a single day, with logs only showing generic "Error kill" messages. By inspecting dmesg and container logs, the team discovered that a GC pause of 18 s + 7 s preceded each kill, causing the pod to fail the 10‑second health check.

Resolution steps included:

Analyzing GC logs with tools such as gceasy.io and adjusting JVM parameters (e.g., increasing young generation size, tuning heap).

Extending Kubernetes health‑check timeout from 10 s/20 s to 2 min.

Optimising GC by reducing promotion failures in the young generation.

Problem 2 – Memory Fragmentation

OOM kill alerts appeared despite the JVM heap staying within its 12 GB Docker limit. Further investigation revealed large 64 MB memory blocks allocated by glibc’s ptmalloc allocator.

Key findings from pmap analysis:

Address: 内存开始地址
Kbytes: 占用内存的字节数(KB)
RSS: 保留内存的字节数(KB)
Dirty: 脏页的字节数(包括共享和私有的)(KB)
Mode: 内存的权限:read、write、execute、shared、private (写时复制)
Mapping: 占用内存的文件、或[anon](分配的内存)、或[stack](堆栈)
Offset: 文件偏移
Device: 设备名 (major:minor)

Because each non‑primary allocation via mmap creates a fixed 64 MB region, many such regions caused fragmentation.

The team replaced ptmalloc with jemalloc , a malloc implementation designed to minimise fragmentation and improve concurrency. After deploying jemalloc in a gray‑scale environment, the 64 MB mappings disappeared, memory usage stabilized, and no further OOM kills were observed over a week of monitoring.

Additional optimisations included improving container dump reliability and correcting monitoring aggregation that previously double‑counted memory after restarts.

Conclusion

Identifying and addressing both GC‑induced latency and allocator‑induced fragmentation restored stability to the hotel‑pricing service, demonstrating the importance of detailed runtime diagnostics and appropriate memory‑management tooling in cloud‑native operations.

JVMKubernetesGCoomjemallocptmallocMemoryFragmentation
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.