MemoryThrashing: A Solution for Live Streaming Memory OOM Issues
MemoryThrashing is a self-developed tool designed to detect and analyze memory thrashing issues in live streaming applications, addressing the challenges of OOM problems by providing efficient memory growth monitoring and analysis capabilities.
Live streaming OOM (Out of Memory) issues are notoriously difficult to diagnose due to the complexity of involved business logic and the time-consuming nature of problem resolution. To proactively identify issues and improve diagnostic efficiency, the MemoryThrashing solution was proposed as a complement to existing tools.
The primary motivation for developing MemoryThrashing stems from the limitations of the existing "MemoryGraph" tool. While "MemoryGraph" can analyze OOM causes through captured memory files, it has significant drawbacks: high performance overhead, low sampling rates, and the inability to easily detect issues without targeting specific users. Additionally, "MemoryGraph" generation may not occur at peak memory usage times, potentially affecting OOM analysis accuracy.
MemoryThrashing defines memory thrashing from a business perspective as significant performance data fluctuations. In memory terms, rapid growth from 600MB to 800MB within a short period constitutes a "thrash." The tool aims to identify the source of such memory growth, as memory spikes often lead to OOM issues. Two common scenarios are observed: memory that doesn't release after spikes (leading to high memory levels and potential OOM) and memory that spikes but then drops without causing OOM (indicating a potential underlying issue).
The solution employs two main approaches: memory zone traversal and runtime monitoring. Memory zone traversal involves iterating through memory nodes to count class instances, while runtime monitoring uses alloc/dealloc hooks to track instance counts. The runtime approach was chosen for its lower overhead and absence of wild pointer issues, with testing showing negligible impact on the main thread.
The monitoring process begins after a user has been in a live room for some time, using memory value changes to determine when to enable sampling. Multiple sampling cycles occur before data is reported, after which monitoring continues. Data is displayed showing top 100 instances, with examples demonstrating how to identify problematic classes like "LivexxxA" that show significant growth across sampling periods.
MemoryThrashing offers several advantages over existing solutions: multiple sampling capabilities for trend analysis, low performance overhead allowing full-scale online deployment, early detection of memory issues, and simple troubleshooting through object count analysis. However, it has limitations including language restrictions (only Objective-C), inability to analyze memory leaks through reference relationships, and potential method cache impacts from hook-based monitoring.
Practical applications have shown MemoryThrashing's effectiveness in early problem detection. Examples include identifying memory accumulation issues where objects are allocated but not released, and detecting temporary object creation problems in scenarios like live streaming with effects processing. These early detections significantly improve troubleshooting efficiency compared to traditional reactive approaches.
Future plans include enhancing attribution capabilities by adding object reference relationship calculations for top growth points, and implementing CPU monitoring to complement memory analysis. CPU monitoring is particularly relevant as many OOM and ANR issues are accompanied by high CPU usage, allowing for better business process identification through thread names and stack traces.
ByteFE
Cutting‑edge tech, article sharing, and practical insights from the ByteDance frontend team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.