Sliver: A High‑Performance Android Method Trace System for ANR and Jank Diagnosis
This article presents Sliver, an Android‑focused method‑trace framework that combines low‑overhead stack sampling, thread‑suspend techniques, and lock‑information capture to reliably detect and diagnose ANR and jank issues in production environments while maintaining minimal performance impact.
Android ANR and jank problems are critical to user experience; traditional monitoring tools either instrument at method entry/exit or sample stacks, both suffering from latency, accuracy, or stability issues, especially when multiple long‑running messages accumulate.
Sliver was designed to overcome these limitations by capturing Method Trace data directly on‑line with high performance, low intrusion, and strong compatibility. It implements a sampling‑based approach that leverages the ART runtime’s native stack‑walking functions (WalkStack) via xDL symbol lookup, avoiding heavy instrumentation.
The core stack‑capture workflow creates a custom StackVisitor containing only the essential fields (shadow_frame and quick_frame) and a minimal virtual VisitFrame implementation, then invokes WalkStack while the target thread is briefly suspended using ThreadList::SuspendThreadByPeer (or the safer SuspendThreadById when available). This reduces crash risk compared to signal‑based callbacks.
To support multi‑threaded tracing, Sliver samples the main thread separately and groups secondary threads among a pool of sampler threads, suspending each thread individually rather than invoking a global SuspendAll , thereby limiting performance loss.
Lock information is collected by extending the stack visitor with StackDumpVisitor logic, calling Monitor::FetchState to record lock owners and thread states, which are then attached to the trace.
Trace data is stored in a ring buffer as pairs of timestamps and event values (method pointers for entry, zero for exit). The buffer is later dumped, and method names are resolved via ArtMethod::PrettyMethod . The output can be converted to Perfetto format for visualization.
Performance tests on high‑end and low‑end devices show average stack‑walk times of 23.6 µs and 43.2 µs respectively, resulting in a negligible (<0.5 %) overhead at a 10 ms sampling interval. Stability improvements include switching from fatal‑level thread‑suspend timeouts to warnings and protecting risky offset‑based code with signal handlers.
In production at Xigua Video, Sliver has been integrated into the ANR and jank governance pipeline, automatically dumping traces on incidents and feeding them back to developers, dramatically improving root‑cause identification and reducing investigation effort.
Future work aims to lower adaptation cost further, add native‑stack support for JNI‑heavy workloads, and explore more stable signal‑based sampling while maintaining the current low‑overhead guarantees.
Watermelon Video Tech Team
Technical practice sharing from Watermelon Video
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.