What Caused Cloudflare’s 4‑Hour Outage? Lessons on Cable Management and Process Clarity
A four‑hour Cloudflare outage was triggered by an unauthorized cable removal during a planned maintenance, compounded by unclear instructions and unlabeled wiring, highlighting the need for better cable management, clear operational procedures, and robust single‑point‑of‑failure mitigation.
Single point of failure, incorrect operation instructions, and unlabeled cables together caused this incident.
Cloudflare admitted that the four‑hour‑plus outage was caused by someone pulling a cable that should not have been touched; technicians followed incorrect instructions and inadvertently disconnected the cable.
The event began with a planned maintenance on one of Cloudflare’s core data centers, where staff were told to “remove all equipment from one rack.”
Cloudflare clarified that the rack contained only old, inactive hardware with no live traffic or data on any servers.
However, the rack also housed a patch panel that provided external connections for other Cloudflare data centers. In just three minutes, the technician who was de‑commissioning the unused hardware also disconnected the cables in this patch panel.
The patch panel turned out to be a single point of failure: from 15:31 UTC to 19:52 UTC, disconnecting multiple redundant fiber links rendered the Cloudflare dashboard and API unusable, and the Argo smart‑routing feature was also impacted, affecting sites that relied on it.
The company’s investigation was delayed because cable labels were unclear, and remote work due to COVID‑19 did not help.
Cloudflare did not blame the technicians, instead stating that the process must change, emphasizing that operation instructions should explicitly indicate which cables must not be touched.
All configuration data remained intact, so customers experienced only service disruption, not data loss.
Chief Technology Officer John Graham‑Cumming apologized for the outage.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.