Memory Monitoring and Leak Detection Practices in NetEase Cloud Music Android App
The NetEase Cloud Music Android team built a comprehensive memory‑monitoring system—combining LeakCanary and KOOM for leak detection, instrumented image loading for large‑bitmap tracking, periodic heap and thread metrics collection, and automated ticket generation—to identify, rank, and resolve leaks, oversized resources, and thread‑related OOM risks across development and production.
Background
As Cloud Music continues to reduce crash rates, OOM crashes remain due to improper memory usage such as leaks, large objects, and large images. Memory issues are hard to detect, reproduce, and diagnose, requiring monitoring tools to assist developers. This article shares the exploration and practice of memory monitoring at Cloud Music, covering several aspects.
Memory Leak Monitoring
Memory leak is when objects that are no longer used are still strongly referenced by longer‑lived GC roots, preventing timely release and causing memory problems.
Leaks increase memory peaks and OOM probability; they are relatively easy to monitor but developers often ignore local detection. An automated tool is needed to monitor leaks and generate tickets for developers.
Leak Monitoring Solution
LeakCanary is an open‑source Java leak detection tool from Square, used during development to detect common Android leaks.
Its advantage is readable results and suggested fixes. LeakCanary works by listening to activity/fragment lifecycle events, passing destroyed objects to an ObjectWatcher that holds a weak reference and checks after 5 seconds whether the reference has been cleared.
LeakCanary’s core flow is shown below:
In test environments LeakCanary works, but it triggers GC and uses Debug.dumpHprofData() , causing noticeable freezes, unsuitable for production.
Koom, an open‑source framework from Kuaishou, improves this by using copy‑on‑write fork to dump Java heap without long freezes. Koom periodically queries heap, thread count, FD count, and when thresholds are crossed it forks the VM, suspends, dumps, resumes, then parses the heap offline with Shark to identify leaks and large objects.
Combining both libraries, we built a two‑dimensional monitoring system (online and offline). Detected leaks and large objects are aggregated, ranked, and automatically ticketed for developers.
Overall flow:
Online: strict conditions (continuous memory top, spikes, thread/FD thresholds) trigger a dump, generate an HPROF file, analyze for leaks, large objects, large bitmaps, etc., and report results. Initially we do not upload the full HPROF to reduce impact; later we may upload trimmed files.
Offline: during automated tests, similar conditions trigger dumps and analysis, reporting to backend.
The platform aggregates issues, sorts by leak count, affected users, average leak rate, and creates tickets.
Currently we monitor leaks of:
Destroyed and finished activities
Fragments whose manager is null
Destroyed windows
Bitmaps exceeding size threshold
Large primitive arrays
Any class with object count exceeding threshold
Cleared ViewModel instances
RootViews removed from window manager
Large Image Monitoring
Bitmaps are the biggest memory consumer in Android apps. Monitoring large images is essential.
Online Large Image Monitoring
We instrument the unified image loading library to capture width, height, file size, and view size. If the image exceeds configured thresholds, we record and report it. We also capture view hierarchy up to 5 levels and compute a “large‑image rate” per page using our custom Shuguang tracing system.
Local Resource Image Monitoring
We scan local image resources after the mergeResources task, list images exceeding thresholds, and report them for pre‑release fixing via automatic ticket creation.
Memory Size Monitoring
Beyond leaks and large images, we need a memory dashboard to understand overall app memory usage, covering startup memory (PSS), runtime memory (PSS), Java heap, threads, etc.
Startup, Runtime, and Java Memory Monitoring
High startup memory leads to poor user experience. We collect Debug.MemoryInfo from all processes (using getMemoryStat on API 23+) to obtain data similar to Android Memory Profiler.
We capture memory at startup completion and periodically during runtime, reporting to the backend for aggregation. Metrics such as average PSS and Java heap top‑rate (usage >85%) are computed.
We also listen to system callbacks like onLowMemory to trigger proactive memory release.
Thread Monitoring
Out‑of‑Memory errors can also stem from thread creation failures, e.g.:
java.lang.OutOfMemoryError: {CanCatch}{main} pthread_create (1040KB stack) failed: Out of memoryThread count limits vary by device; Huawei EMUI caps at 500 threads.
We monitor Cloud Music’s thread count; when thresholds are exceeded we report and compute a thread top‑rate.
We also hook native thread lifecycle methods (pthread_create, pthread_detach, pthread_join, pthread_exit) to detect thread leaks, reporting them similarly.
Conclusion
Cloud Music’s memory monitoring started later than industry peers, but by building on existing tools and tailoring them to our scenario we have created a continuous, fine‑grained monitoring system. It is an ongoing effort that aims to help developers discover and resolve memory issues efficiently.
References
https://github.com/square/leakcanary
https://github.com/KwaiAppTeam/KOOM
https://juejin.cn/post/7134728428003000356#heading-30
https://blog.yorek.xyz/android/paid/master/memory_2/#_1
NetEase Cloud Music Tech Team
Official account of NetEase Cloud Music Tech Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.