Operations 9 min read

Root Cause Analysis of Redis Slowlog Anomalies in Containerized Environments: Kernel TSC Bug on Skylake‑X Platforms

This article details a real‑world investigation of unexpected Redis slowlog latencies after container migration, tracing the issue through slowlog mechanics, contradictory monitoring data, gettimeofday inaccuracies, NTP drift, and ultimately a kernel bug affecting Skylake‑X CPUs, and offers concrete recommendations for reliable time measurement.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Root Cause Analysis of Redis Slowlog Anomalies in Containerized Environments: Kernel TSC Bug on Skylake‑X Platforms

Author Background : Li Jian, senior software engineer at Ctrip Technology Assurance Center, responsible for containerizing Redis and MySQL services.

Problem Description : After migrating Redis instances to containers (second batch of five hosts), DBA alerts showed slowlog entries exceeding 500 ms, with some reaching 1800 ms, while CAT tracing reported much lower latencies, creating contradictory logs.

Analysis Process :

1. What is slowlog? Redis records operations exceeding the slowlog‑log‑slower‑than threshold; it measures only the execution time of the command itself.

2. Contradictory logs : Slowlog consistently reported 1800 ms, but CAT (Ctrip’s customized tracing tool) showed maximum 367 ms, indicating a discrepancy.

3. Community feedback : Suggestions pointed to NUMA effects, which were later ruled out.

4. Manual timing test : A simple loop sleeping one second and printing gettimeofday() differences revealed occasional extra ~1.8 s, hinting at a system‑clock issue.

5. System clock investigation : The hosts use the Time Stamp Counter (TSC) as the clock source. cat /sys/devices/system/clocksource/clocksource0/current_clocksource confirms TSC. TSC‑derived wall‑clock can drift and be corrected by NTP.

6. NTP synchronization : Manual NTP sync showed large offsets (≈1.7 s) on the problematic hosts, unlike the first‑batch machines.

7. Kernel bug discovery : A kernel commit for versions 4.9‑4.13 introduced a macro INTEL_FAM6_SKYLAKE_X that mis‑calculates the crystal frequency on Skylake‑X CPUs, causing the clock to lose ~1 s every 10 minutes. The second‑batch hosts (Xeon Bronze 3104, Skylake‑X) were affected; the first batch used different CPUs and older kernels (3.10), so they were not impacted.

Resolution : Upgrading the kernel to 4.14 (or later) eliminates the TSC frequency bug.

Recommendations :

For wall‑clock timestamps, prefer clock_gettime(CLOCK_REALTIME) over the deprecated gettimeofday() .

For measuring elapsed time, use clock_gettime(CLOCK_MONOTONIC) (or Java’s System.nanoTime() ) to avoid NTP‑induced drift.

By applying these practices, similar slowlog anomalies can be avoided in future container migrations.

performanceRediscontainerizationtimekeepingkernel bugslowlog
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.