Accelerating Stack Trace for Android Native Memory Leak Detection Using TLS and Compiler Instrumentation
By instrumenting every C++ function with GCC’s ‑finstrument‑functions and recording call addresses in per‑thread TLS, the team built a fast, lock‑free stack‑trace mechanism that outperforms libunwind by up to 60×, integrates with Android’s malloc_debug, and powers an automated native‑memory‑leak detection framework with web‑based analysis.
C++ memory‑leak analysis and定位 on the Android platform has long been a pain point for developers. Because core functions such as map rendering and navigation have high performance requirements, the AMap app contains a large amount of C++ code. Solving this problem is crucial for product quality, and the team has formed a complete solution through practice.
The core of analyzing and locating memory‑leak problems lies in statistics of allocation functions and stack back‑trace . Knowing only the allocation point without the call stack makes the problem much harder, and vice‑versa.
Android’s Bionic malloc_debug module provides good monitoring and statistics for memory‑allocation functions, but efficient stack back‑trace is lacking. Existing Google‑provided back‑trace methods have several drawbacks:
They rely on libunwind , which incurs high overhead when called frequently in native‑heavy code, causing UI stutter.
ROM size constraints limit daily development testing.
Command‑line or DDMS operations require manual environment preparation each time and the results are not intuitive, lacking comparative analysis.
Therefore, an efficient stack back‑trace and a systematic Android Native memory‑analysis framework are essential.
The team improved and extended two key points, enabling automated testing to promptly discover and resolve issues, greatly improving development efficiency and reducing investigation cost.
Stack Trace Acceleration
Android mainly uses libunwind for stack back‑trace, which satisfies most cases but suffers from global locks and unwind‑table parsing, leading to performance loss under multi‑threaded frequent calls.
Acceleration principle
The compiler option -finstrument-functions inserts calls to custom functions at the entry and exit of every function. At function entry, __cyg_profile_func_enter is called; at exit, __cyg_profile_func_exit is called. These functions can obtain the call‑site address, allowing the call stack to be recorded at any time.
Example of instrumentation effect (image omitted).
Functions that should not be instrumented can be marked with __attribute__((no_instrument_function)) .
To record the call information across threads without locking, Thread‑Local Storage (TLS) is used. TLS provides a per‑thread storage area that does not require synchronization.
The implementation consists of the following steps:
Use -finstrument-functions to insert instrumentation code during compilation.
In TLS, store call addresses in an array with a cursor for fast insert/delete/retrieval.
Define the array‑plus‑cursor data structure:
typedef struct {
void* stack[MAX_TRACE_DEEP];
int current;
} thread_stack_t;Initialize the TLS key for thread_stack_t :
static pthread_key_t sBackTraceKey;
static pthread_once_t sBackTraceOnce = PTHREAD_ONCE_INIT;
static void __attribute__((no_instrument_function)) destructor(void* ptr) {
if (ptr) {
free(ptr);
}
}
static void __attribute__((no_instrument_function)) init_once(void) {
pthread_key_create(&sBackTraceKey, destructor);
}Allocate and store thread_stack_t in TLS:
thread_stack_t* __attribute__((no_instrument_function)) get_backtrace_info() {
thread_stack_t* ptr = (thread_stack_t*)pthread_getspecific(sBackTraceKey);
if (ptr) return ptr;
ptr = (thread_stack_t*)malloc(sizeof(thread_stack_t));
ptr->current = MAX_TRACE_DEEP - 1;
pthread_setspecific(sBackTraceKey, ptr);
return ptr;
}Implement the instrumentation callbacks to record addresses into TLS:
extern "C" {
void __attribute__((no_instrument_function)) __cyg_profile_func_enter(void* this_func, void* call_site) {
pthread_once(&sBackTraceOnce, init_once);
thread_stack_t* ptr = get_backtrace_info();
if (ptr->current > 0)
ptr->stack[ptr->current--] = (void*)((long)call_site - 4);
}
void __attribute__((no_instrument_function)) __cyg_profile_func_exit(void* this_func, void* call_site) {
pthread_once(&sBackTraceOnce, init_once);
thread_stack_t* ptr = get_backtrace_info();
if (++ptr->current >= MAX_TRACE_DEEP)
ptr->current = MAX_TRACE_DEEP - 1;
}
}The second parameter call_site of __cyg_profile_func_enter is the address of the call site, which is stored in the TLS array; the cursor moves left on entry and right on exit.
Provide an interface to retrieve the back‑trace from TLS:
int __attribute__((no_instrument_function)) get_tls_backtrace(void** backtrace, int max) {
pthread_once(&sBackTraceOnce, init_once);
int count = max;
thread_stack_t* ptr = get_backtrace_info();
if (MAX_TRACE_DEEP - 1 - ptr->current < count) {
count = MAX_TRACE_DEEP - 1 - ptr->current;
}
if (count > 0) {
memcpy(backtrace, &ptr->stack[ptr->current + 1], sizeof(void*) * count);
}
return count;
}Compile the above logic into a dynamic library, link other modules against it, and add -finstrument-functions to the compile flags. All functions will then have their call stacks recorded in TLS, and any code can call get_tls_backtrace to obtain the stack.
Performance comparison (Google benchmark on Huawei Honor 5S, Android 5.1)
libunwind single‑thread
TLS single‑thread
libunwind 10 threads
TLS 10 threads
Charts (images omitted) show that the TLS method is about 10× faster than libunwind in single‑thread mode and 50‑60× faster with 10 threads.
Pros and Cons
Pros: Significant speed boost, meeting the need for frequent stack back‑trace.
Cons: Compiler instrumentation increases binary size; not suitable for production builds, only for memory‑test packages. This can be mitigated by CI pipelines that generate separate test libraries.
Systematic Solution
By addressing the slow stack acquisition, the team built a systematic framework to further improve native memory‑leak detection:
Reuse malloc_debug from LIBC for memory monitoring, hooking all allocation functions to the debug module.
Implement get_backtrace_external in malloc_debug using the TLS back‑trace method.
Establish socket communication for external programs to fetch memory data.
Develop a web front‑end to display parsed memory data, using addr2line for address resolution.
Create automated test cases that collect memory information via socket, upload results for analysis, and send evaluation emails. Alerts allow developers to inspect memory curves and call stacks directly on the web UI.
System effect screenshots (images omitted) demonstrate the end‑to‑end workflow.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.