Backend Development 7 min read

Investigation and Resolution of Random Nacos Service Deregistration in a Spring Cloud Alibaba Microservice Cluster

This article details a week‑long investigation of intermittent Nacos service deregistration in a Spring Cloud Alibaba microservice environment, describing the background architecture, multiple hypothesis tests, diagnostic commands, kernel version mismatch, and the final fix by upgrading the Linux kernel.

Architect's Guide
Architect's Guide
Architect's Guide
Investigation and Resolution of Random Nacos Service Deregistration in a Spring Cloud Alibaba Microservice Cluster

Background

Our system runs on 11 Alibaba Cloud servers forming a Spring Cloud Alibaba microservice cluster with 60 services registered to a single Nacos registry. Traffic flows through nginx → spring-gateway → business services . The stack includes Spring Boot 2.2.5.RELEASE, Spring Cloud Hoxton.SR3, Spring Cloud Alibaba 2.2.1.RELEASE, and Java 1.8.

Incident

During a holiday period, the gateway reported "service not found" errors. Inspection of the Nacos console showed that several services had disappeared from the registry. The issue recurred every few days, each time affecting a random subset of services; manual kill‑and‑restart kept them alive for only 2‑3 days.

Investigation

Hypothesis 1 – Memory Exhaustion: Cloud console metrics were missing, so we logged into the servers and ran free -m , which showed no abnormal memory usage.

Hypothesis 2 – CPU Saturation: top displayed normal CPU load.

Hypothesis 3 – Disk Full: du -sh * indicated ample disk space.

Hypothesis 4 – Network Issues: Commands such as telnet , mtr -n , and netstat -nat | grep "TIME_WAIT" | wc -l gave only rough insights. Adjusting echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse to enable TIME_WAIT reuse did not resolve the problem.

Hypothesis 5 – Nacos Server Fault: Server metrics were normal, but Nacos logs showed active deregistration of services without clear cause.

Hypothesis 6 – Microservice Resource Usage: Increasing each service's memory allocation and adding stack traces did not change the behavior.

Further debugging with arthas revealed that the heartbeat thread sometimes stopped, suggesting a JVM "dead" state. Attempts to capture stack traces with jstack and jmap failed, and the services appeared to be in a pseudo‑dead state.

Searching online pointed to a Linux kernel bug causing JVM freeze. Comparing kernel versions with uname -r identified the problematic machines running an older kernel.

Resolution

We upgraded the Linux kernel on the affected servers and rebooted them. After two days of observation, the random Nacos service deregistrations ceased, confirming the kernel bug as the root cause.

Conclusion

The case illustrates the complexity of diagnosing intermittent microservice failures and highlights the importance of low‑level OS stability, especially when relying on heartbeat mechanisms for service discovery.

microservicesbackend developmentNacosTroubleshootingLinux KernelSpring Cloud
Architect's Guide
Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.