Analysis of TCP Connection Failures Caused by ARP Queue Length (unres_qlen) in Linux Kernels
The article investigates intermittent TCP connection failures during application server startup caused by the Linux kernel ARP queue length parameter unres_qlen, reproduces the issue with a concurrent connection test, analyzes kernel internals, and recommends increasing unres_qlen for kernels prior to 3.3.
Background: In a production environment we observed that when an application server starts and creates a connection pool to a backend database, some connections occasionally fail to establish. Investigation revealed the issue is related to the kernel ARP parameter unres_qlen .
Reproduction environment: OS RHEL 6.6, kernel 2.6.32-504.el6.x86_64. A test program runs on a client machine (10.0.0.102) that concurrently initiates 16 TCP connections with a 500 ms timeout to a server (10.0.0.101).
Phenomenon: After clearing the ARP cache on the client, only three of the sixteen connections succeed; the remaining thirteen time out. Re‑running the test after the failure clears the ARP cache again reproduces the timeout, while subsequent runs succeed.
Problem analysis: Packet capture on the server shows that only three SYN packets are received; the other thirteen never appear. Dropwatch logs indicate that the kernel function __neigh_set_probe_once is invoked 13 times, matching the failed connections. The function discards packets when the ARP queue length exceeds neigh->parms->queue_len , which is derived from the sysctl net.ipv4.neigh.*.unres_qlen .
Kernel parameter details: neigh/default/unres_qlen defines the maximum number of packets queued for each unresolved address (default 31 in modern kernels, deprecated value 3 before Linux 3.3). When the queue is full, additional SYN packets are dropped, causing TCP retransmission timeouts.
TCP connection establishment process:
1) Application sends SYN.
2) IP layer performs routing.
3) ARP layer queries the next‑hop MAC address; if no ARP entry exists, the SYN is placed in the ARP queue (limited by unres_qlen ) and an ARP request is sent.
4) Upon ARP reply, the queued SYN is transmitted.
With unres_qlen set to 3, concurrent connections exceeding this limit lose their SYN packets, leading to timeout failures.
Conclusion: In scenarios where applications open many simultaneous TCP connections (e.g., database connection pools) and use short connection‑timeout settings, the default unres_qlen value can cause sporadic connection failures. For kernels earlier than 3.3, increasing unres_qlen (e.g., to 64) resolves the issue.
References:
Understanding RTT impact on TCP retransmissions
Linux kernel IP sysctl documentation
Additional note: The problem can also be reproduced by sending a large ping packet, which triggers the same ARP queue overflow.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.