Databases 6 min read

Diagnosing Periodic Redis Latency Caused by BGSAVE and Fork Overhead

The article analyzes a recurring one‑second pause in a Redis 4.0 master‑slave setup, identifies the periodic BGSAVE fork operation as the root cause through metrics like latest_fork_usec, and presents mitigation strategies such as memory limits, activedefrag, and migration to a lower‑RSS instance.

Aikesheng Open Source Community

Apr 25, 2022

Diagnosing Periodic Redis Latency Caused by BGSAVE and Fork Overhead

In an online Redis 4.0 master‑slave deployment, developers reported a periodic latency of about one second occurring roughly every ten minutes, affecting both GET and SET commands.

Initial monitoring of QPS and CPU showed no anomalies, and the Redis slowlog did not contain matching entries. The evicted_keys metric remained zero, while expired_keys was high but stable, ruling out key eviction as the cause.

Attention was drawn to the latest_fork_usec metric, which recorded a near‑one‑second fork duration approximately every 15 minutes, coinciding with the observed latency. This indicated that the regular BGSAVE operation, which forks a child process, was responsible.

Although BGSAVE uses copy‑on‑write and is generally considered low‑impact, the fork operation still duplicates the parent’s page tables. On Linux, this incurs time proportional to the size of the page tables, as shown in the excerpt:

Under Linux, fork() is implemented using copy‑on‑write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child.

With the Redis instance’s RSS reaching 16 GB and a page‑table size of about 33 MB, the fork took over a second, blocking all client requests because Redis runs a single‑process model.

Using strace -p 20324 -e trace=clone -T confirmed the fork duration, and the latest_fork_usec metric matched the observed latency (≈1.0 s).

# strace -p 20324 -e trace=clone -T
... clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f409d771a10) = 30793 <1.013945> ...

Disabling BGSAVE would eliminate the pause but risks data loss if the master fails before a replica takeover. Instead, the team migrated the Redis instance to a new server where RSS was only 8.8 GB and latest_fork_usec dropped to ~0.25 s, substantially reducing the latency.

Redis 4.0 also introduced automatic memory fragmentation reclamation via the activedefrag parameter (disabled by default). After migration, enabling activedefrag kept used_memory_rss_human around 11 GB and latest_fork_usec near 0.76 s.

For future incidents, the recommended approach is to monitor latest_fork_usec, enforce memory limits per Redis instance, consider Redis Cluster or maxmemory settings, and, if necessary, restart the instance during maintenance windows.

In summary, regular BGSAVE‑induced forks can cause noticeable latency in high‑memory Redis deployments; controlling memory usage and monitoring fork duration are key to mitigating the issue.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Memory Fragmentation fork BGSAVE

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.