Long‑Term Client Crash Governance Mechanism at Qunar: Architecture, Detection, and Resolution Strategies
This article describes Qunar's systematic client crash governance framework, covering background challenges, APM‑based fast problem discovery, multi‑level alerting, common‑issue remediation, code‑level fixes for URL and Bundle size crashes, detection tools, code checks, automated testing, and the measurable improvements achieved in Android and iOS stability.
Author Introduction Jiang Baogui, architect in Qunar's large‑frontend team since 2011, focuses on mobile quality, monitoring system design, and framework upgrades.
Preface Client crash governance is a systematic solution that belongs to the observability metrics of the underlying infrastructure, aiming to quickly detect, alert, and locate client crashes, freezes, and errors to improve user experience and service availability.
Background Qunar, an online travel platform with dozens of business lines and frequent resource‑package releases (70‑80 times per day), faces challenges from rapid feature iteration, mixed tech stacks, and diverse user network conditions, making crash reduction critical.
Solution Overview During the pandemic, the framework team built a long‑term client quality assurance mechanism with three pillars: rapid problem discovery, precise alerting, and systematic remediation.
1. Fast Problem Discovery
APM collects millions of logs daily; the system extracts key stack information, aggregates by BugId, and classifies new vs. known issues.
Different exception types (Android native, SO libraries, iOS, ReactNative) are de‑obfuscated and symbolized for clear localization.
APM integrates with the MPortal build‑pack platform to map libraries, SO resources, iOS pages, and owners, enabling accurate business‑line alarm contacts.
Noise reduction via dynamic hook/xposed keyword filtering.
APM Architecture
Fine‑grained monitoring extracts lib, exception type, and key stack to generate BugId; if the BugId is new and exceeds a user‑impact threshold, alerts are sent via QTalk or phone.
Coarse‑grained monitoring (Watcher) tracks total crash volume; if today's impact exceeds 150% of the 7‑day average, a warning is issued.
Daily, bi‑weekly, and dashboard reports surface top new crashes for each business line.
2. General Issue Resolution
Patch delivery for hot‑reload capable frameworks (e.g., ReactNative).
Service‑interface compatibility for API changes (high maintenance cost).
Forced upgrades for critical bugs.
Version‑upgrade policy: force upgrade for users older than two years, optional for one‑to‑two‑year versions, and minimal prompts for recent releases.
3. Specific Technical Fixes
URL format exception
Problem: malformed URLs (e.g., missing scheme) cause IllegalArgumentException in OkHttp.
Fatal Exception: java.lang.IllegalArgumentException Expected URL scheme 'http' or 'https' but no colon was found
okhttp3.HttpUrl$Builder.parse$okhttp (HttpUrl.kt:1260)
...Solution: use Android Transform + ASM to inject URL validation into OkHttp's Builder.url() methods.
public Builder url(String url) {
String url = HttpUtils.checkNullUrl(str); // insert null check
if (url == null) {
throw new NullPointerException("url == null");
}
// existing scheme handling
return url(HttpUrl.get(HttpUtils.checkUrl(url))); // insert format check
}Utility class returns a placeholder 404 URL when validation fails.
Bundle size crash (TransactionTooLargeException)
Problem: large serialized Bundle data (>100 KB) during onSaveInstanceState leads to crashes.
void recodeBundleSize(String activityName, Bundle bundle) {
Bundle copyBundle = bundle.deepCopy();
int totalSize = getParcelSize(copyBundle);
Log.d("BundleSize", activityName + " totalSize:" + totalSize);
if (totalSize > 100 * 1024) {
for (String itemKey : copyBundle.keySet().toArray(new String[0])) {
int itemSize = getParcelSize(bundle.get(itemKey));
Log.d("BundleSize", activityName + " itemSize:" + itemSize);
}
}
}
int getParcelSize(Object data) {
Parcel deepData = Parcel.obtain();
try {
deepData.writeValue(data);
return deepData.dataPosition();
} finally {
deepData.recycle();
}
}Analysis showed that deep View hierarchy state caused the crash; targeted pages were optimized, reducing this crash to single‑digit occurrences.
4. Detection Tools Integrated LeakCanary for JS memory leaks, runtime warnings for URL misconfiguration, and custom alerts for missing listeners.
5. Code Checks Adopted SwiftLint, Sonar, ESLint, etc., to block commits with high‑severity issues; enforced pre‑release gating and gray‑release validation.
6. Automated Testing Connected build pipelines to the TARS‑UI automated test system for end‑to‑end verification of main flows.
Results Android crash rate dropped from 0.15% to ~0.02%; iOS from 0.1% to <0.02%, outperforming industry averages. The long‑term mechanism now provides rapid detection, standardized remediation, and continuous quality assurance.
Future Outlook Continue to enhance observability, AI‑driven root‑cause analysis, and a knowledge base that auto‑suggests solutions for recurring issues, further improving development efficiency and user experience.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.