Backend Development 8 min read

Debugging High CPU Usage in Nacos Config Client and Understanding Raft Network Partition

The article details a sudden CPU spike in a Java backend service using Nacos, walks through diagnosing the offending thread with top and jstack, analyzes Nacos client thread creation and scheduled gray‑config tasks, explains Raft network partition handling, and presents a fix that checks config health and avoids unnecessary thread creation.

Architecture Digest
Architecture Digest
Architecture Digest
Debugging High CPU Usage in Nacos Config Client and Understanding Raft Network Partition

In the afternoon of a testing day the author noticed the CPU of a Java backend project surged to about 60%, causing noticeable response delays.

Using top -Hp the high‑CPU thread was identified, and jstack <pid> > 1.txt revealed a thread named com.alibaba.nacos.client.config.security.updater stuck in TIMED_WAITING inside Nacos client internals.

The source code shows that Nacos creates a ScheduledThreadPoolExecutor with a custom ThreadFactory that names the thread exactly as above and marks it as a daemon:

this.executorService = new ScheduledThreadPoolExecutor(1, new ThreadFactory() {
    @Override
    public Thread newThread(Runnable r) {
        Thread t = new Thread(r);
        t.setName("com.alibaba.nacos.client.config.security.updater");
        t.setDaemon(true);
        return t;
    }
});

During project initialization this executor is created once, but a scheduled task for gray‑config verification repeatedly calls NacosFactory.createConfigService(properties) , which in turn constructs a new Nacos config instance each time, spawning another updater thread and eventually exhausting CPU.

The scheduled verification code looks like:

scheduler.schedule("定时校对灰度nacos 配置" , () -> loadGrayConfig(grayFileName), 1800, 1800, TimeUnit.SECONDS);

and the method it invokes:

private void loadGrayConfig(String grayFileName) {
    synchronized (this) {
        System.err.println("loadGrayConfig datetime: " + DateUtils.formatDate(new Date()));
        // Refresh cache and reload Nacos content
        grayConfigManager.loadNoCache(grayFileName);
    }
}

Beyond the immediate Nacos issue, the author discusses how Raft handles network isolation: a partitioned node repeatedly times out, its term grows, and after reconnection the leader must increase its term, leading to a new election. The solution in Tendermint uses a pre‑vote lock and term‑node‑id tuples to avoid conflicting commits.

To fix the CPU problem the author changed the approach: instead of creating a new Nacos config on every schedule, they now check the existing connection’s health, shut it down if dead, and recreate it only when necessary, thereby preventing the extra updater threads.

After applying the fix the CPU usage returned to normal and the test environment stabilized.

debuggingJavathreadpoolNacosCPURaft
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.