Debugging Service Discovery Failures in a Go RPCX Microservice Using Zookeeper
This article analyzes a microservice client’s connection‑timeout errors in a Kubernetes cluster, investigates Zookeeper‑based service discovery, uncovers one‑time watch semantics and a concurrency bug in rpcx’s channel handling, and presents a fix to ensure IP list updates.
On 2020‑12‑25 a microservice client in a Kubernetes cluster reported connection timeouts when dialing a service, indicating a service discovery update failure.
The environment uses rpcx with Zookeeper as the registry. Initial investigation checks the IP address of failing pods, discovers the IP is not present in Zookeeper, and suspects stale IP lists.
Network traces with netstat -anp | grep 2181 and tcpdump show two established TCP connections to Zookeeper, one for registration and one for discovery, both exchanging regular heartbeat packets.
Further analysis reveals that Zookeeper watches are one‑time; after a change the watch is removed, so subsequent changes are not notified.
Review of the rpcx discovery code shows the use of ChildrenW (watching) and Children (no watch). The watch‑once behavior can cause missed updates during rapid rolling upgrades.
A concurrency bug is identified: the discovery component creates a channel for each watcher and appends it to a slice without synchronization, leading to lost channels when multiple goroutines register concurrently.
func (d *ZookeeperDiscovery) WatchService() chan [] *KVPair {
ch := make(chan []*KVPair, 10)
d.chans = append(d.chans, ch)
return ch
}An example Go program reproduces the race condition, showing that concurrent appends to a slice can result in missing elements.
package main
import (
"fmt"
"sync"
)
func main() {
ok := true
for i := 0; i < 1000; i++ {
var arr []int
wg := sync.WaitGroup{}
for j := 0; j < 2; j++ {
wg.Add(1)
go func() {
defer wg.Done()
arr = append(arr, i)
}()
}
wg.Wait()
if len(arr) < 2 {
fmt.Printf("error:%d \n", i)
ok = false
break
}
}
if ok {
fmt.Println("ok")
}
}
//error:261Removing the disaster‑recovery logic that relied on the unreliable watch resolves the issue, and proper synchronization (e.g., mutexes) should be used when modifying shared slices.
The article concludes with recommendations to handle Zookeeper watch semantics correctly and to guard shared slice modifications with proper synchronization to avoid similar concurrency bugs.
Xueersi Online School Tech Team
The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.