Big Data 10 min read

Design and Practice of OPPO Big Data Diagnostic Platform

This article presents the background, technical architecture, feature set, workflow, and practical results of OPPO's big data diagnostic platform, illustrating how intelligent, non‑intrusive task analysis improves efficiency, stability, and cost across massive offline and real‑time workloads.

DataFunSummit
DataFunSummit
DataFunSummit
Design and Practice of OPPO Big Data Diagnostic Platform

In the context of massive data volumes and numerous tasks in the big data industry, intelligent and automated abnormal task identification and analysis are crucial; this article shares the design and practice of OPPO's big data diagnostic platform.

Background: OPPO's big data environment exceeds 100 million records, includes over 20 system components, millions of offline tasks and thousands of real‑time tasks, and involves more than 1,000 analysts and developers, leading to challenges such as uneven developer skill, long task chains, zombie tasks, and complex operations.

Industry comparison: Existing open‑source tools like Dr. Elephant provide some diagnostics but suffer from limited compatibility, few Spark metrics, and stability issues, prompting OPPO to develop its own platform.

Technical solution – Platform features: The platform offers non‑intrusive real‑time diagnosis without modifying existing schedulers, supports OPPO’s proprietary scheduler as well as open‑source schedulers (DolphinScheduler, Airflow), multi‑version support for Flink, Hadoop and Spark, over 40 anomaly types, and customizable rules and thresholds.

System architecture: It follows a three‑layer design: an external system adaptation layer that collects metrics from Yarn, schedulers, compute engines and clusters; a diagnostic layer handling data collection, metadata linking, model standardization, anomaly detection and a portal; and a common foundation component layer.

Process stages: First, metadata from workflow systems and engine metrics are collected; then the data are linked into a standardized model; finally, a knowledge base and heuristic rules are applied to detect anomalies, combined with cluster and runtime status to produce diagnostic results.

Practice effects – Interactive design: A unified, concise web UI presents task issues at a glance and offers guidance suggestions. The platform provides rich diagnostic types: efficiency analysis (long‑tail tasks, HDFS stalls, speculative execution), stability analysis (full‑table scans, data skew, shuffle failures, OOM), real‑time analysis (empty runs, insufficient parallelism, back‑pressure), and cost analysis (CPU/memory waste, long‑term failures).

Representative cases: Efficiency case shows long‑tail task analysis with recommendations for data skew or slow reads; cost case demonstrates CPU waste detection with threshold‑based alerts; stability case illustrates data skew detection and mitigation advice; real‑time case highlights Flink parameter waste analysis and optimization suggestions.

Summary and planning: The platform enables intelligent diagnosis for scheduling and compute engines, helping users quickly locate and optimize tasks, thereby reducing costs and improving efficiency. Future work includes incorporating data‑mining algorithms to broaden detection, supporting additional engines, and expanding the knowledge base.

Open‑source invitation: The project, named “Compass,” is open‑sourced on GitHub, supporting DolphinScheduler, Airflow, multiple Spark/Hadoop versions, 14 anomaly types, and customizable rule/threshold configurations, inviting community participation.

Big Datadata-platformDiagnosticsperformance analysisOPPOTask Optimization
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.