Big Data 12 min read

Accelerating Data Production and Consumption in Baidu's Performance Platform

Baidu's Performance Platform speeds data production and consumption by adopting a unified stream‑batch architecture with TM and Spark, leveraging the Turing warehouse, introducing tiered service grading, robust governance and compliance measures, and offering self‑service analytics, cutting latency from minutes or days to milliseconds while handling billions of daily records and boosting SLA adherence, data accuracy, and user satisfaction.

Baidu Geek Talk

Mar 6, 2023

Accelerating Data Production and Consumption in Baidu's Performance Platform

The Performance Platform, operated by the Performance Middle‑Platform team, is a one‑stop solution for APP performance tracking. It provides comprehensive, professional, and real‑time performance analysis services and toolchains for Baidu’s mobile products, covering data management, ingestion, transmission, and application.

Data acceleration is crucial for report building, decision analysis, and conversion strategy effectiveness. This article introduces practical acceleration methods for both data production and consumption, presenting five optimization categories and 18 specific techniques.

Key Terminology

• Turing : Next‑generation data‑warehouse platform with improved real‑time computing, storage, query engine, and resource scheduling.

• UDW (Universal Data Warehouse): Baidu’s early data‑warehouse providing unified, high‑quality user‑behavior data.

• TM : Distributed workflow‑oriented computation system offering high reliability, throughput, and near‑real‑time stream processing.

• Yimai : Self‑service analytics tool integrating multiple data sources for analysts, PMs, operations, and RDs.

• AFS : Baidu’s Append‑Only distributed file system used for offline computing, AI training, and data backup.

Business Overview

The platform covers most of Baidu’s key products, including Baidu APP, Mini‑Programs, Matrix APP, etc., handling data at the scale of hundreds of billions of records per day, with end‑to‑end ingestion latency measured in milliseconds and serving over 600 million users.

Challenges

1. Data‑technology infrastructure lag: outdated storage, slow query engines, lack of real‑time compute, and weak resource elasticity. 2. Absence of service grading: unclear distinction between core and non‑core data, leading to SLA violations. 3. Redundant or duplicate data reporting, causing resource waste and inflated data volumes. 4. Compliance difficulties: unified data export, multi‑system reporting, and legal requirements for data security. 5. Low user satisfaction due to high latency (minutes to days) and slow delivery of data‑driven features.

Optimization Paths

3.1 New vs. Old Infrastructure : Adopt a stream‑batch unified approach, using TM for real‑time parsing and Spark for dynamic indexing, replacing static QE imports to UDW. This flattens nested fields early, creating large tables in Turing and eliminating intermediate tables.

3.2 Data Service Grading : Introduce a tiered service model to improve efficiency, quality, and resource allocation.

3.3 Data Governance : Implement multi‑dimensional clustering, heatmaps, log call‑chain analysis, and other tools for rapid issue localization.

3.4 Compliance Governance : Align with the Data Security Law, enforce data classification, and adopt API‑based data access to improve compliance and performance.

3.5 Metadata Query Protocol (code example)

// Schema data acquisition capability description
// Protocol capability description:
// 1. Data query capability, multi‑engine/standard SQL support (e.g., palo/mysql/clickhouse/es‑sql) Query struct.
// 2. Data aggregation capability, supports single‑key combination & Merge if len(Schema.Query) > 1.
// 3. Data caching capability, two‑level cache encapsulated in Cache struct.

type Schema struct {
    Query []Query `json:"query"`
    Cache Cache   `json:"cache"`
}

// Query data query capability description
type Query struct {
    SQL    string `json:"sql"`
    Prod   string `json:"prod"` // RPC name = meta_{engine}_{prod}.toml
    Engine string `json:"engine"`
    Cache  Cache  `json:"cache"`
}

3.6 Self‑Service Data Construction

• Real‑time (in‑memory) self‑service : Parse UBC log schemas, flatten protocols, build wide tables in memory, apply aggregation templates (PV, UV, quantiles), and configure UI charts and alerts.

• Near‑real‑time & offline (disk) self‑service : Layered data architecture reduces data volume per tier, improving report latency. Users create wide tables, documentation, and self‑service dashboards to meet requirements quickly.

Future Outlook

• Enhance data timeliness by defining SLA for each pipeline stage and tracking data funnels. • Improve data accuracy through case‑based validation between expected and actual data. • Continue to empower business growth on a foundation of data security and compliance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data engineering Big Data Real-time Processing Data Governance performance platform

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.