Databases 19 min read

Root Cause Analysis and Resolution of Intermittent Redis Connection Failures

This article presents a detailed investigation of occasional Redis connection errors in a large‑scale production environment, analyzing network packets, TCP backlog behavior, Redis internal client‑cron logic, jemalloc memory reclamation, and ultimately resolving the issue by adjusting query‑buffer handling and upgrading Redis to a newer version.

Ctrip Technology

Oct 17, 2018

Root Cause Analysis and Resolution of Intermittent Redis Connection Failures

Author Introduction Zhang Yanjun, senior DBA at Ctrip Technical Support Center, and Shou Xiangchen, senior DBA, both specialize in MySQL and Redis operations, automation, and troubleshooting.

Redis is a widely used open‑source cache database at Ctrip. This case study records an intermittent Redis connection error, exploring its root cause from network and kernel perspectives.

1. Problem Description

In production, a Redis instance occasionally reports connection failures. The error messages show "Unable to Connect redis server" without a clear client IP pattern, and the issue resolves itself after a short period.

Redis version: 2.8.19 (stable but old). The affected cluster has hundreds of client servers, with connection counts jumping from ~3500 to ~4100.

2. Problem Analysis

Initial checks ruled out port exhaustion and slow queries. Network packet capture revealed a sudden surge in TCP connections and packet loss (e.g., 1699 outgoing packets dropped).

Analysis showed that the server’s accept queue overflowed, causing the server to reset connections.

3. Backlog Overflow

In high‑concurrency short‑connection services, TCP uses two queues: the SYN (half‑open) queue and the accept queue. When the accept queue is full, the server either drops client packets or sends a reset, depending on the tcp_abort_on_overflow setting.

Wireshark captured over 2000 connection attempts per second. Increasing the backlog size from 511 to 2048 did not solve the issue, indicating that the problem lies elsewhere.

4. Network Packet Analysis

Using Wireshark and editcap to segment packets into 30‑second intervals, the team identified a 1.43 s period where the Redis server was blocked, the client connection pool filled, and the accept queue overflowed.

5. Further Analysis – Stack Traces

Repeated pstack captures highlighted the clientsCronResizeQueryBuffer function within serverCron() as a suspect. The function runs periodically to reclaim space in each client’s query buffer.

#define CLIENTS_CRON_MIN_ITERATIONS 5

void clientsCron(void) {
    int numclients = listLength(server.clients);
    int iterations = numclients / server.hz;
    mstime_t now = mstime();
    if (iterations < CLIENTS_CRON_MIN_ITERATIONS)
        iterations = (numclients < CLIENTS_CRON_MIN_ITERATIONS) ? numclients : CLIENTS_CRON_MIN_ITERATIONS;
    while (listLength(server.clients) && iterations--) {
        listRotate(server.clients);
        listNode *head = listFirst(server.clients);
        client *c = listNodeValue(head);
        if (clientsCronHandleTimeout(c, now)) continue;
        if (clientsCronResizeQueryBuffer(c)) continue;
    }
}

The function resizes the query buffer when it exceeds PROTO_MBULK_BIG_ARG (32 KB) or when the client has been idle >2 s with a buffer >1 KB.

int clientsCronResizeQueryBuffer(client *c) {
    size_t querybuf_size = sdsAllocSize(c->querybuf);
    time_t idletime = server.unixtime - c->lastinteraction;
    if (((querybuf_size > PROTO_MBULK_BIG_ARG) && (querybuf_size/(c->querybuf_peak+1)) > 2) ||
        (querybuf_size > 1024 && idletime > 2)) {
        if (sdsavail(c->querybuf) > 1024) {
            c->querybuf = sdsRemoveFreeSpace(c->querybuf);
        }
    }
    c->querybuf_peak = 0;
    return 0;
}

Because each client’s query buffer is allocated at least 32 KB on first use, many idle connections trigger frequent resize operations, blocking the single‑threaded Redis server.

6. Network Packet Block Analysis

Further packet inspection confirmed that during the 1.43 s block the server only sent a single PUSH packet; all other traffic involved connection handling (SYN, ACK, RST).

7. Root Cause Identification

The backlog overflow stemmed from the clientsCronResizeQueryBuffer logic: ~50 connections simultaneously met the resize criteria, causing the server to spend time reclaiming buffer space and blocking other clients.

8. Code Modification

The team refined the resize condition to reduce unnecessary operations:

if (((querybuf_size > REDIS_MBULK_BIG_ARG) && (querybuf_size/(c->querybuf_peak+1)) > 2) ||
    (querybuf_size > 1024*32 && idletime > 2)) {
    if (sdsavail(c->querybuf) > 1024*32) {
        c->querybuf = sdsRemoveFreeSpace(c->querybuf);
    }
}

This change lowers the frequency of costly resizes, at the cost of additional memory (≈160 MB for 5 k connections).

9. Jemalloc Purge Investigation

Even after the code tweak, occasional errors persisted. Stack traces showed heavy activity in je_pages_purge(), indicating jemalloc’s memory reclamation was a bottleneck.

Redis 2.8 uses jemalloc 3.6, whose arena_purge performs extensive counting and traversal. Upgrading to Redis 4.0 (jemalloc 4.1) replaces this with a more efficient implementation that avoids costly counters.

static void arena_purge(arena_t *arena, bool all) {
    chunk_hooks_t chunk_hooks = chunk_hooks_get(arena);
    size_t npurge, npurgeable, npurged;
    arena_runs_dirty_link_t purge_runs_sentinel;
    extent_node_t purge_chunks_sentinel;
    arena->purging = true;
    // ... simplified logic without heavy counting ...
    npurge = arena_compute_npurge(arena, all);
    npurgeable = arena_stash_dirty(arena, &chunk_hooks, all, npurge,
                                 &purge_runs_sentinel, &purge_chunks_sentinel);
    npurged = arena_purge_stashed(arena, &chunk_hooks, &purge_runs_sentinel, &purge_chunks_sentinel);
    arena->purging = false;
}

10. Final Solution

By adjusting the query‑buffer resize logic and upgrading Redis to version 4.0.9 (which includes the newer jemalloc), the intermittent connection timeout issue was eliminated.

11. Summary

Redis’s high‑concurrency usage in production can expose subtle performance problems such as excessive query‑buffer resizing and inefficient memory reclamation. Careful analysis of network traces, server‑side stacks, and source code, combined with targeted code changes and version upgrades, resolves these issues.

Hope this case study helps engineers facing similar Redis reliability challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance tuning network analysis jemalloc connection timeout

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.