MDAP: A Multi‑Dimensional Real‑Time Monitoring and Analysis Platform for Mobile Applications
MDAP is a multi‑dimensional real‑time monitoring platform for mobile apps that gathers metrics, logs, and traces via lightweight SDKs, processes data through micro‑service back‑ends using Flink, Spark, and ClickHouse, applies intelligent analysis for smoothness scoring, memory‑snapshot optimization, stack de‑obfuscation, crash clustering, and URL templating, and aims to extend end‑to‑end observability and predictive issue detection.
1. Background
As Shopee’s business grows, teams need fine‑grained observability of terminal‑side data such as page conversion rates, user retention, CPU/memory/network usage, crashes and ANRs. Performance and stability directly affect conversion metrics; even a 100 ms latency can cost 1 % of sales (Amazon) and a 0.5 s page load increase can cut traffic by 20 % (Google). Therefore, real‑time monitoring of mobile app performance and stability is critical.
MDAP (Multiple Dimension Analysis Platform) is a multi‑dimensional real‑time monitoring platform that supports custom business metrics and provides specialized monitoring for mobile app performance data.
2. MDAP Architecture
The platform consists of observable data collection on the client side and backend analysis services. The backend follows DDD principles, is micro‑service‑oriented, and can be deployed with Helm on a K8s cluster.
MDAP Backend : processes incoming monitoring data, filters dirty data, parses uploaded files, and extracts issues such as memory leaks, duplicate objects, or oversized objects. It uses Intelligentize to generate data models and automatically tag problem categories.
DI Platform : built on Shopee’s Data Infra, uses Flink for streaming ingestion and light pre‑computation, stores data, and aggregates it with Spark on hourly/daily/weekly dimensions to provide real‑time metric queries.
Intelligentize : a smart computation service that applies Spark ML (semi‑supervised or unsupervised) on pre‑processed data to generate models for downstream micro‑services.
Boussole : a real‑time analytics engine that pulls data, applies user‑defined metrics and dimensions, aggregates results, stores them, and feeds dashboards and alerts, reducing repeated ClickHouse queries.
3. Metric Collection and Drill‑Down
3.1 Data Types
Metrics, Logging, and Tracing are the three core data categories. Metrics are atomic time‑series (e.g., CPU, memory, network). Logging captures discrete events such as logs, memory snapshots, and stack traces. Tracing records request/session context and call chains.
MDAP stores metrics in ClickHouse for fast time‑series queries and uses ElasticSearch for logging and tracing to enable fuzzy search and tokenization.
3.2 Terminal Collection Capabilities
SDKs are modular; developers can pick needed modules to keep SDK size and overhead minimal. The SDK also provides encryption, remote configuration, dynamic sampling, and black‑/white‑list capabilities. Compilation plugins (Android Gradle, Webpack, etc.) automate symbol upload and lifecycle instrumentation.
3.3 Data Drill‑Down and Positioning
Metrics can be drilled down by dimensions (version, region, device) to locate problematic segments. The UI offers three view types: multi‑dimensional metric view, single‑instance view (detailed stack, logs), and custom view for business‑specific dashboards.
4. Precise and Efficient Problem Analysis
4.1 Precise Smoothness Analysis
Traditional smoothness metrics (jank rate, FPS) are coarse. MDAP buckets frame rendering times into nine buckets (1 frame, 2 frames, … > 32 frames) and assigns weights to compute a smoothness score. Example: scenario A scores 81.8 % smoothness, scenario B 69.2 %.
When smoothness falls below a threshold, alerts are triggered and users can drill down by version or other dimensions.
4.2 High‑Performance Memory Snapshot Parsing
Android’s Debug.dumpHprofData pauses all threads (ScopedSuspendAll) and can cause OOM. MDAP’s SDK uses a copy‑on‑write forked process to take snapshots without blocking the main process, reducing snapshot cost from ~10 s to ~0.1 s.
// If "direct_to_ddms" is true, the other arguments are ignored, and data is
// sent directly to DDMS.
// If "fd" is >= 0, the output will be written to that file descriptor.
// Otherwise, "filename" is used to create an output file.
void DumpHeap(const char* filename, int fd, bool direct_to_ddms) {
CHECK(filename != nullptr);
Thread* self = Thread::Current();
// Need to take a heap dump while GC isn’t running.
gc::ScopedGCCriticalSection gcs(self,
gc::kGcCauseHprof,
gc::kCollectorTypeHprof);
ScopedSuspendAll ssa(__FUNCTION__, true /* long suspend */);
Hprof hprof(filename, fd, direct_to_ddms);
hprof.Dump();
}MDAP also trims HPROF files by hooking read/write to remove unnecessary sections, reducing upload size.
4.3 Efficient Stack Restoration
Obfuscated or compressed stack traces are restored using symbol files. MDAP implements Go‑based parsers for iOS (atos) and Android symbols, caching KV‑structured symbol data to avoid repeated parsing and to support inline function resolution.
4.4 Precise Problem Aggregation
MDAP aggregates crash reports by hashing stack traces, but simple hashing can split similar crashes. The platform uses a learned model (WESTSD – Weight‑based Trace Similarity Detection) that combines TF‑IDF, frame position weighting, and machine‑learning to cluster similar stacks more accurately, reducing duplicate analysis.
4.5 Automatic URL Template Normalization
MDAP normalizes URLs by separating path and query, computing term frequencies, and applying Gaussian‑based entropy minimization. Low‑frequency tokens are replaced with placeholders using a Markov‑chain‑driven pruning step. Example: /something/aGVsbG8K becomes /something/* . This reduces thousands of distinct URLs to a few hundred templates, enabling clearer API‑level performance analysis.
5. Future Plans
MDAP will extend end‑to‑end data linkage between front‑end and back‑end, integrate with backend monitoring and logging, and provide a unified observability view.
It will also add intelligent analysis capabilities by correlating development‑stage data (code changes, test cases) with runtime observations to enable proactive issue prediction and root‑cause analysis.
References
Amazon latency study, Gartner APM definition, academic papers on crash clustering (ReBucket, TraceSim, DURFEX), URL normalization research, and datasets (Netbeans, Eclipse).
Shopee Tech Team
How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.