Cloud Native 8 min read

Analysis of Didi’s Kubernetes Outage and General Mitigation Strategies

The article reviews Didi’s 12‑hour P0 outage caused by a Kubernetes upgrade failure in a massive cluster, discusses the root causes, and proposes general solutions such as federation, careful upgrade planning, and multi‑master designs to avoid similar incidents.

Code Ape Tech Column

Dec 4, 2023

Analysis of Didi’s Kubernetes Outage and General Mitigation Strategies

Hello everyone, I’m Chen.

Around 10 pm on November 27, a P0‑level bug lasted for about 12 hours until noon on November 28, causing losses exceeding ten million dollars and affecting transactions worth over four hundred million dollars.

Simple Summary of the Crash Cause

DD’s own statement on Weibo said the incident was due to a failure in the underlying system software. As a low‑level developer I was curious, and rumors suggested that an improper Kubernetes upgrade caused the cluster to collapse, and the huge scale of the cluster amplified the impact.

In DD’s apology on Weibo they said it was a bottom‑system software fault.

Rumors claim the failure was triggered by the upgrade.

Coincidentally DD’s tech blog previously published an article titled “DD Elastic Cloud Scheduling Practice Based on K8S” , which described their chosen upgrade scheme and the reasons behind it.

The article outlines DD’s upgrade plan.

DD still runs an old version of Kubernetes, indicating they have been using K8S for a long time.

General Solutions

First, a comparison of two approaches is presented in DD’s technical article. I will share how I have dealt with similar problems in my own work.

Problem 1: Cluster Size Too Large

Kubernetes officially recommends a limit of 5,000 nodes. While exceeding this does not guarantee failure, the incident clearly shows the danger of operating beyond that threshold.

General Solution

When a production cluster reaches the size limit, we typically adopt a federation architecture, linking multiple clusters into a federated cluster. This allows network and K8S resources to interoperate, raises the business capacity ceiling, and distributes risk across clusters. Although it adds some operational overhead, it is far safer than endlessly adding nodes to a single cluster.

Problem 2: Choosing an Upgrade Strategy

Large‑scale clusters like DD’s usually perform upgrades during nighttime windows. Direct in‑place upgrades are rare; most teams use a backup‑upgrade approach where traffic is gradually shifted to the new cluster after verification. In‑place upgrades are only attempted when the risk is well‑understood.

General Solution

DD’s blog suggests their in‑place upgrade was internally validated, but for most production environments the risk is still too high. Therefore, a replacement‑upgrade strategy is recommended.

Problem 3: Control‑Plane Nodes

The rumor about control‑plane node crashes seems exaggerated. Large enterprises should employ multiple master nodes and avoid placing all masters in the same data center, following basic disaster‑recovery principles.

Ramblings

Recently many big‑tech products have crashed, first Alibaba then Didi, and with the wave of layoffs, a lot of jokes have emerged, the most famous being 开猿节流，降本增笑. Indeed, labor cost is the biggest expense for internet companies; cutting developers after a product matures may look attractive, but it leaves a company full of PPT experts and few hands‑on engineers, inevitably leading to technical problems. I hope leaders think twice before laying off staff and respect the technical teams.

Final Note (Don’t Free‑Ride, Please Follow)

Every article I write is carefully crafted. If this piece helped or inspired you, please like, view, share, and bookmark—it’s the biggest motivation for me to keep going.

My “Knowledge Planet” is now open; joining costs 199 CNY and provides huge value, including projects on cloud‑based chronic disease management, Spring full‑stack practice series, billion‑scale sharding practice, DDD micro‑service columns, and many more resources such as “How to Enter Big Tech”, Spring/MyBatis source code analysis, architecture practice, and RocketMQ deep dive.

More introduction

To join the planet, add Chen’s WeChat: special_coder

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes cluster scaling incident analysis upgrade strategy

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.