Operations 12 min read

Root Cause Analysis of Dubbo Connect Timeout in High‑Concurrency Scenarios and Backlog Tuning

This article presents a detailed case study of intermittent Dubbo connect‑timeout errors in a high‑concurrency deployment, describing step‑by‑step diagnostics—from port status checks and registry verification to TCP dump analysis—and explains how adjusting the server’s backlog and accept queue resolved the SYN‑drop issue.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Root Cause Analysis of Dubbo Connect Timeout in High‑Concurrency Scenarios and Backlog Tuning

Problem Background

A core service in CTrip’s vacation division runs on 80 Docker containers (4C8G each) across two data centers, serving over 1,300 client machines. After switching from HTTP to TCP via CDubbo, occasional client connect‑timeout errors appeared during deployments.

Investigation Steps

1. Port Opening Check : Verified that CDubbo opens ports synchronously before registration, ruling out asynchronous port opening.

2. Registry Push Verification : Examined Dubbo logs; port opening timestamps (e.g., 16:57:19) preceded failed connections (16:57:51), indicating the registry was not at fault.

3. Port Closure Hypothesis : Added shell script to poll port status every second. The port remained in LISTEN state, so it was not being closed unexpectedly.

4. Accept‑Log Enhancement : Inserted logging at Netty’s channelConnected to capture when connections were accepted. Logs showed some connections were accepted while others were rejected.

5. Server‑Side TCP Dump : Captured packets on the server; SYN packets arrived but no ACK was sent, suggesting loss before the kernel’s accept handling.

6. Container‑Side Connectivity Test : Ran a Bash script inside the container to repeatedly telnet to the service port. Intermittent failures (output “0”) confirmed that SYN packets were being dropped inside the container as well.

#!/bin/bash
for i in `seq 1 3600`
do
  t=`timeout 0.1 telnet localhost 20xxx
&1 | grep -c 'Escape character is'`
  echo $(date) "20xxx check result:" $t
  sleep 0.005
done

7. SYN Queue Overflow Analysis : netstat -s showed 3220 listen‑queue overflows and 3220 SYN drops, indicating the accept queue was saturated while the SYN queue remained empty.

8. Backlog Configuration : Examined ss -lnt output; the accept queue size was 50, far below the kernel’s somaxconn (128). Netty 3 defaults to a backlog of 50, whereas Netty 4 uses 1024.

9. Backlog Adjustment Experiment : Tested various backlog values on an 8‑core server with 10 client containers. Results:

Backlog

Connections/s

SYN Drop?

128

3000

No

128

5000

Few

1024

5000

No

1024

10000

No

Increasing the backlog to 1024 eliminated SYN drops even at 10,000 connections per second.

Conclusion

The connect‑timeout issue was caused by the server’s accept queue being full, leading the kernel to drop incoming SYN packets. Adjusting the Netty backlog (and consequently the accept queue) to a higher value resolved the problem, demonstrating the importance of proper socket backlog tuning in high‑concurrency backend services.

DubboTCPhigh concurrencyBacklogconnect timeoutNetwork Tuning
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.