Big Data 20 min read

OPPO Real-Time Computing Platform Architecture and Practices

This article details OPPO’s real-time computing platform architecture, covering its background, open‑source and self‑developed components, job lifecycle, SQL IDE, diagnostic and monitoring mechanisms, SLA guarantees, practical applications such as real‑time warehousing and dashboards, and future plans for lakehouse integration and cloud‑native deployment.

DataFunTalk
DataFunTalk
DataFunTalk
OPPO Real-Time Computing Platform Architecture and Practices

OPPO, a top‑3 smartphone manufacturer, runs a massive big‑data platform that stores over 600 PB of data and processes daily increments of billions of rows, leveraging a combination of open‑source technologies (Flink, Spark, Trino, Yarn) and self‑developed ingestion, real‑time, batch, interactive analysis, and data‑quality systems.

The real‑time computing platform is built on Flink and supports both SQL and JAR jobs. Its architecture includes an interactive IDE, Data API, Open API, Job Gateway for compilation and submission to Yarn or K8s clusters, Backend for monitoring, MetaData for job metadata, and an intelligent monitoring service, all designed for high availability and multi‑version Flink support.

Job development follows a lifecycle: developers write SQL or JAR jobs in the IDE, which validates syntax and permissions via the API, compiles the job in the Gateway, and submits it to the chosen cluster. The Backend periodically reconciles job status between metadata and cluster state, handling restarts and automatic recovery.

The SQL IDE provides a one‑stop interface showing job metadata, a code editor with formatting and auto‑completion, parameter configuration, version management, and a debugging console. Currently over 3,000 jobs run on the platform, with more than 80% developed in SQL.

A dedicated job‑diagnosis system collects metrics and logs throughout the job lifecycle, analyses failures, presents readable feedback and tuning suggestions, and stores analysis results in a database and Elasticsearch for traceability, integrating with alert callbacks from the intelligent monitoring platform.

Link monitoring tracks latency across the OBUS → Kafka → Flink pipeline, recording timestamps at each stage and exposing a custom metric in Flink. Alerts can be configured for latency spikes, and the collected data feeds a real‑time SLA report that compares business‑defined tolerance with observed delays.

Practical applications include a real‑time data warehouse (ODS → DWD → aggregation) and real‑time dashboards for e‑commerce events. The classic pipeline uses Canal → Kafka → Flink, while a newer, shorter Flink CDC pipeline reads MySQL binlogs directly, chosen based on maturity, latency, and data volume considerations.

Future plans focus on lakehouse integration using Iceberg for near‑real‑time warehousing and on cloud‑native support with K8s scheduling, aiming for elastic scaling and resource sharing across large internal clusters.

The Q&A section clarifies metadata management (MySQL vs. FlinkHive Catalog), schema evolution for Kafka tables, join strategies, and the ongoing effort to enable per‑job SQL submissions on K8s.

Cloud NativeBig DataFlinkPlatform ArchitectureReal-Time Computingjob monitoring
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.