Backend Development 14 min read

Debugging Service Discovery Failures in a Go RPCX Microservice Using Zookeeper

This article analyzes a microservice client’s connection‑timeout errors in a Kubernetes cluster, investigates Zookeeper‑based service discovery, uncovers one‑time watch semantics and a concurrency bug in rpcx’s channel handling, and presents a fix to ensure IP list updates.

Xueersi Online School Tech Team
Xueersi Online School Tech Team
Xueersi Online School Tech Team
Debugging Service Discovery Failures in a Go RPCX Microservice Using Zookeeper

On 2020‑12‑25 a microservice client in a Kubernetes cluster reported connection timeouts when dialing a service, indicating a service discovery update failure.

The environment uses rpcx with Zookeeper as the registry. Initial investigation checks the IP address of failing pods, discovers the IP is not present in Zookeeper, and suspects stale IP lists.

Network traces with netstat -anp | grep 2181 and tcpdump show two established TCP connections to Zookeeper, one for registration and one for discovery, both exchanging regular heartbeat packets.

Further analysis reveals that Zookeeper watches are one‑time; after a change the watch is removed, so subsequent changes are not notified.

Review of the rpcx discovery code shows the use of ChildrenW (watching) and Children (no watch). The watch‑once behavior can cause missed updates during rapid rolling upgrades.

A concurrency bug is identified: the discovery component creates a channel for each watcher and appends it to a slice without synchronization, leading to lost channels when multiple goroutines register concurrently.

func (d *ZookeeperDiscovery) WatchService() chan [] *KVPair {
    ch := make(chan []*KVPair, 10)
    d.chans = append(d.chans, ch)
    return ch
}

An example Go program reproduces the race condition, showing that concurrent appends to a slice can result in missing elements.

package main

import (
    "fmt"
    "sync"
)

func main() {
    ok := true
    for i := 0; i < 1000; i++ {
        var arr []int
        wg := sync.WaitGroup{}
        for j := 0; j < 2; j++ {
            wg.Add(1)
            go func() {
                defer wg.Done()
                arr = append(arr, i)
            }()
        }
        wg.Wait()
        if len(arr) < 2 {
            fmt.Printf("error:%d \n", i)
            ok = false
            break
        }
    }
    if ok {
        fmt.Println("ok")
    }
}
//error:261

Removing the disaster‑recovery logic that relied on the unreliable watch resolves the issue, and proper synchronization (e.g., mutexes) should be used when modifying shared slices.

The article concludes with recommendations to handle Zookeeper watch semantics correctly and to guard shared slice modifications with proper synchronization to avoid similar concurrency bugs.

debuggingmicroservicesConcurrencyservice discoveryGoZookeeperrpcx
Xueersi Online School Tech Team
Written by

Xueersi Online School Tech Team

The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.