Live Streaming Service Overload Incident Caused by Self-Referencing Push Configuration
A sudden surge in live‑stream traffic overloaded the core streaming service because a push configuration mistakenly pointed to the same stream URL, creating a self‑referencing loop that repeatedly generated duplicate streams until the service capacity was exhausted.
Incident Overview
On a certain live‑streaming platform, the core data‑stream service experienced a rapid linear increase in stream connections at noon. Within 20 minutes the service reached full capacity, triggering overload protection and preventing new live streams from starting.
Investigation and Resolution
Restarting the core service did not help; the same overload occurred after 20 minutes. It was discovered that all incoming streams originated from the platform’s own data center and carried identical content, suggesting a self‑referencing push configuration.
Removing the push configuration and restarting the service restored normal operation.
Root Cause
The live stream was created with a push (relay) configuration whose URL pointed to the same stream (self‑URL).
According to the push logic, this creates an additional stream that pushes back to the core service using the identical URL.
The core service’s implementation allows the second stream with the same URL to pre‑empt the first, while the first remains connected but inactive; this design aims to let the first resume playback when the second disconnects.
Because the second stream pre‑empts the first, the push logic continues to generate new streams, eventually saturating the service.
Fix and Recommendations
Disallow push URLs that belong to the same live‑stream’s own space.
When a second stream appears with the same URL, terminate the first stream.
Commentary
The incident involved two bugs: (1) the push logic did not forbid self‑referencing streams, a design oversight that lacked negative test cases; (2) the handling of duplicate URLs retained the first stream’s connection, which, while a product decision, introduced risk. Together they caused a severe outage despite the low probability of the trigger scenario.
Byte Quality Assurance Team
World-leading audio and video quality assurance team, safeguarding the AV experience of hundreds of millions of users.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.