Mobile Development 15 min read

Memory Monitoring and Leak Detection Practices in NetEase Cloud Music Android App

The NetEase Cloud Music Android team built a comprehensive memory‑monitoring system—combining LeakCanary and KOOM for leak detection, instrumented image loading for large‑bitmap tracking, periodic heap and thread metrics collection, and automated ticket generation—to identify, rank, and resolve leaks, oversized resources, and thread‑related OOM risks across development and production.

NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Memory Monitoring and Leak Detection Practices in NetEase Cloud Music Android App

Background

As Cloud Music continues to reduce crash rates, OOM crashes remain due to improper memory usage such as leaks, large objects, and large images. Memory issues are hard to detect, reproduce, and diagnose, requiring monitoring tools to assist developers. This article shares the exploration and practice of memory monitoring at Cloud Music, covering several aspects.

Memory Leak Monitoring

Memory leak is when objects that are no longer used are still strongly referenced by longer‑lived GC roots, preventing timely release and causing memory problems.

Leaks increase memory peaks and OOM probability; they are relatively easy to monitor but developers often ignore local detection. An automated tool is needed to monitor leaks and generate tickets for developers.

Leak Monitoring Solution

LeakCanary is an open‑source Java leak detection tool from Square, used during development to detect common Android leaks.

Its advantage is readable results and suggested fixes. LeakCanary works by listening to activity/fragment lifecycle events, passing destroyed objects to an ObjectWatcher that holds a weak reference and checks after 5 seconds whether the reference has been cleared.

LeakCanary’s core flow is shown below:

In test environments LeakCanary works, but it triggers GC and uses Debug.dumpHprofData() , causing noticeable freezes, unsuitable for production.

Koom, an open‑source framework from Kuaishou, improves this by using copy‑on‑write fork to dump Java heap without long freezes. Koom periodically queries heap, thread count, FD count, and when thresholds are crossed it forks the VM, suspends, dumps, resumes, then parses the heap offline with Shark to identify leaks and large objects.

Combining both libraries, we built a two‑dimensional monitoring system (online and offline). Detected leaks and large objects are aggregated, ranked, and automatically ticketed for developers.

Overall flow:

Online: strict conditions (continuous memory top, spikes, thread/FD thresholds) trigger a dump, generate an HPROF file, analyze for leaks, large objects, large bitmaps, etc., and report results. Initially we do not upload the full HPROF to reduce impact; later we may upload trimmed files.

Offline: during automated tests, similar conditions trigger dumps and analysis, reporting to backend.

The platform aggregates issues, sorts by leak count, affected users, average leak rate, and creates tickets.

Currently we monitor leaks of:

Destroyed and finished activities

Fragments whose manager is null

Destroyed windows

Bitmaps exceeding size threshold

Large primitive arrays

Any class with object count exceeding threshold

Cleared ViewModel instances

RootViews removed from window manager

Large Image Monitoring

Bitmaps are the biggest memory consumer in Android apps. Monitoring large images is essential.

Online Large Image Monitoring

We instrument the unified image loading library to capture width, height, file size, and view size. If the image exceeds configured thresholds, we record and report it. We also capture view hierarchy up to 5 levels and compute a “large‑image rate” per page using our custom Shuguang tracing system.

Local Resource Image Monitoring

We scan local image resources after the mergeResources task, list images exceeding thresholds, and report them for pre‑release fixing via automatic ticket creation.

Memory Size Monitoring

Beyond leaks and large images, we need a memory dashboard to understand overall app memory usage, covering startup memory (PSS), runtime memory (PSS), Java heap, threads, etc.

Startup, Runtime, and Java Memory Monitoring

High startup memory leads to poor user experience. We collect Debug.MemoryInfo from all processes (using getMemoryStat on API 23+) to obtain data similar to Android Memory Profiler.

We capture memory at startup completion and periodically during runtime, reporting to the backend for aggregation. Metrics such as average PSS and Java heap top‑rate (usage >85%) are computed.

We also listen to system callbacks like onLowMemory to trigger proactive memory release.

Thread Monitoring

Out‑of‑Memory errors can also stem from thread creation failures, e.g.:

java.lang.OutOfMemoryError: {CanCatch}{main} pthread_create (1040KB stack) failed: Out of memory

Thread count limits vary by device; Huawei EMUI caps at 500 threads.

We monitor Cloud Music’s thread count; when thresholds are exceeded we report and compute a thread top‑rate.

We also hook native thread lifecycle methods (pthread_create, pthread_detach, pthread_join, pthread_exit) to detect thread leaks, reporting them similarly.

Conclusion

Cloud Music’s memory monitoring started later than industry peers, but by building on existing tools and tailoring them to our scenario we have created a continuous, fine‑grained monitoring system. It is an ongoing effort that aims to help developers discover and resolve memory issues efficiently.

References

https://github.com/square/leakcanary

https://github.com/KwaiAppTeam/KOOM

https://juejin.cn/post/7134728428003000356#heading-30

https://blog.yorek.xyz/android/paid/master/memory_2/#_1

performanceAndroidoomLeak Detectionmemory-monitoringKOOMLeakCanary
NetEase Cloud Music Tech Team
Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.