Mobile Development 10 min read

Implementation of a Main Thread Lag Collection SDK for Android

This article describes the design and implementation of a non‑intrusive Android SDK that monitors main‑thread UI lag by replacing the Looper printer, sampling stack traces, aggregating data on the server, and automatically generating work orders for precise performance optimization.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Implementation of a Main Thread Lag Collection SDK for Android

Author: Duan Yunfei, senior Android engineer at JD.com, responsible for the image framework, performance optimization, and data collection on the Android platform.

The platform team builds an open avatar platform, providing one‑stop technical solutions, with performance monitoring as a key component; this article focuses on the UI‑thread lag collection system.

Large Android projects often suffer from UI lag due to complex business logic, rapid version iteration, massive legacy code, and third‑party libraries, making it difficult to pinpoint the exact cause when the app becomes sluggish.

Typical lag factors include time‑consuming operations on the UI thread, complex or unreasonable layouts with over‑draw, abnormal memory usage causing frequent GC, and incorrect asynchronous implementations. The primary cause is time‑consuming work on the UI thread.

The desired monitoring system should be non‑intrusive, precisely locate the problematic line, and have no impact on app performance.

The overall architecture consists of four parts: (1) a main‑thread lag collection SDK, (2) a performance data reporting SDK, (3) a server that aggregates the collected data, and (4) an automatic work‑order generation component that routes issues to the responsible engineers.

2. Main Thread Lag Collection SDK Implementation

2.1 Monitoring Principle

1. The main thread has a single Looper. The static field sMainLooper ensures only one Looper exists, and all code on the main thread goes through Looper.loop() .

public static void loop() {
    ...
    for (;;) {
        ...
        Printer logging = me.mLogging;
        if (logging != null) {
            logging.println(">>>> Dispatching to " + msg.target);
        }
        msg.target.dispatchMessage(msg);
        if (logging != null) {
            logging.println("<<<<< Finished to " + msg.target);
        }
    }
    ...
}

2. Replace the main‑thread Printer to intercept start and end timestamps of each message. The replacement can be done via the public API Looper.getMainLooper().setMessageLogging(printer) or via reflection if no API is exposed.

3. A lag is detected when endTime - startTime exceeds a predefined threshold.

4. A sampling thread periodically captures the main thread’s stack trace and CPU information, sleeping briefly between samples to avoid interfering with short‑duration messages.

2.2 Core Flow

Sampling thread: periodically creates lightweight objects (using a custom linked‑list object pool) to record stack traces.

Main thread: when lag is detected, extracts stack information from the sampling pool for the time window T2‑T1 and forwards it to a cache pool.

Cache pool: a memory cache with a timer that checks if data meets upload conditions and then reports it.

2.3 Data Processing

Data is classified into two categories:

Confirmed lag : consecutive samples have identical stack traces, indicating the function has not completed within the interval.

Suspected lag : stack traces differ between intervals.

Stack preprocessing includes aggregating identical stacks by a count field to reduce duplicate storage and filtering for JD‑related package names to mark key lines for server‑side aggregation.

Collection strategy can be customized by app version, build number, Android OS version, feature flags, network type, and real‑time reporting options, allowing precise targeting of specific users or scenarios.

Aggregated results are visualized on the server (see image below).

2.4 Issues Encountered

1. Printer replacement conflicts: other modules (e.g., WebView) may overwrite the main‑thread Printer. The solution is to provide a hidden “backdoor” that enables the replacement only when needed by H5 developers.

2. Obtaining the current Printer: Looper does not expose a getter, so reflection is used to access the private mLogging field.

/**
 * Reflectively obtain the main thread's Printer object
 * @return the Printer or null
 */
private static Printer getMainPrinter() {
    try {
        Field privatePrinterField = Looper.class.getDeclaredField("mLogging");
        privatePrinterField.setAccessible(true);
        Looper mainLooper = Looper.getMainLooper();
        Printer oldPrinter = (Printer) privatePrinterField.get(mainLooper);
        if (oldPrinter != null) {
            return oldPrinter;
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return null;
}

3. Summary

The lag collection SDK is a crucial component of JD's APM system, gathering millions of lag samples daily, enabling precise identification of lag sources and reasons, and, through big‑data aggregation, providing a clear view of lag trends across app versions.

However, data collection is only the first step; collaboration among QA, testing, and development teams is required to analyze the data and implement optimizations that ultimately reduce the app’s lag rate.

sdkAndroidPerformance MonitoringUI Threadlag detection
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.