Big Data 15 min read

Ensuring Data Accuracy and Reliability in Baidu's Log Middle Platform

This article describes Baidu's log middle platform architecture, its data lifecycle management, integration status, terminology, service overview, core challenges of ensuring data accuracy, and the implemented optimizations for persistent storage, service decomposition, and SDK reporting to achieve near‑100% no‑repeat no‑loss reliability.

Architect
Architect
Architect
Ensuring Data Accuracy and Reliability in Baidu's Log Middle Platform

1. Overview

1.1 Mid‑platform Positioning

Log middle platform is a one‑stop service for event data lifecycle management, enabling quick collection, transmission, management and query analysis, suitable for product operation analysis, R&D performance analysis, and operations management, helping APP and server clients explore data, extract value, and anticipate future trends.

1.2 Integration Status

Log middle platform covers most key internal products, including Baidu APP full logging, mini‑programs, matrix APPs, with integration benefits:

Integration: Almost all internal APPs, mini‑programs, incubated APPs, and acquired external APPs are covered.

Service Scale: Billions of log entries per day, peak QPS of several million per second, service stability 99.9995%.

1.3 Terminology

Client: Software directly used by users, deployed on phones or PCs, e.g., Baidu APP, mini‑programs.

Server: Services responding to client requests, typically deployed on cloud servers.

Log Middle Platform: Refers to the endpoint log platform, covering the full lifecycle of log data, including SDK, server, and management platform.

Logging SDK: Collects, packages, and reports log events; varies by client type (APP, H5) and scenario (general, performance, mini‑program).

Logging Server: Core log receiving service on the server side.

Feature/Model Service: Real‑time forwarding of points requiring strategy/model computation to the downstream recommendation platform.

1.4 Service Overview

The log service consists of foundation layer, management platform, business data applications, and product support. In June 2021, Baidu client log reporting specifications were released.

Foundation Layer: Supports APP‑SDK, JS‑SDK, performance SDK, general SDK, enabling rapid integration; uses big‑data infrastructure to distribute data to downstream applications.

Platform Layer: Manages metadata, controls the entire log lifecycle; supports real‑time and offline forwarding with flow control and monitoring, ensuring 99.995% stability.

Business Capability: Logs are output to data warehouse, performance platform, recommendation platform, growth platform, aiding product decision analysis, quality monitoring, and growth strategies.

Business Support: Covers key APPs, newly incubated matrix APPs, and generic components.

2. Core Goals of Log Middle Platform

The platform must guarantee data accuracy, preventing duplication (“no repeat”) and loss (“no loss”). Achieving near‑100% “no repeat, no loss” faces many challenges.

2.1 Log Platform Architecture

Log data flows from client production through online services to real‑time or offline downstream forwarding, passing several stages.

Data usage types: Real‑time: Near‑real‑time stream (message queue): high timeliness, strict accuracy; used by R&D and trace platforms. Pure real‑time stream (RPC proxy): second‑level timeliness, tolerates some loss; used by recommendation systems. Offline: large tables, day‑level/hour‑level timeliness, strict accuracy. Other: mixed timeliness and accuracy requirements.

2.2 Problems

Monolithic module: logging server handles all processing logic, causing heavy coupling. Multiple functions: integration & persistence, business logic, various forwarding (RPC, MQ, PB storage). Many fan‑out streams: over 10 business fan‑out flows.

Direct MQ integration: risk of message loss, cannot meet “no repeat, no loss”.

Lack of business tiering: core and non‑core services are coupled, affecting each other.

3. Implementing “No Repeat, No Loss”

3.1 Theory of Data Loss Prevention

Client side: environmental factors (white screen, crash, non‑persistent process) cause some loss.

Access layer: server failures (restart, crash) cause loss.

Computation layer: stream processing must ensure strict “no repeat, no loss”.

3.2 Architecture Optimizations

3.2.1 Logging Server Refactoring (Access Layer)

Prioritize persistence before business processing.

Decompose monolithic service to reduce complexity.

Maintain flexibility and ease of use while supporting stream processing.

3.2.1.1 Persistent First

Persist data at the access layer before business handling; real‑time streams use disk + Minos forwarding to MQ, achieving minute‑level delay with minimal loss.

Persist data before business processing.

Real‑time stream: avoid direct MQ, use disk + Minos forwarding to ensure at most minute‑level delay.

3.2.1.2 Service Decomposition & Function Offloading

Separate real‑time stream, high‑timeliness, and other services to isolate resources and meet different SLA requirements.

Real‑time stream services: access → fan‑out → business → downstream.

High‑timeliness services: dedicated RPC forwarding for recommendation, SLA >99.95%.

Other services: monitoring, VIP, gray‑release with relaxed timeliness and loss requirements.

Technical stack: stream compute architecture for end‑to‑end “no repeat, no loss”.

3.2.2 Logging SDK Reporting Optimization (Client Side)

Increase reporting opportunities: scheduled tasks, trigger on business events, threshold‑based batch sending.

Increase message batch size within safe limits to improve first‑time delivery.

Optimizations improved client‑side convergence by over 2%.

4. Outlook

Future work includes addressing disk failures to ensure strict data non‑loss and further enhancing the log middle platform’s reliability for accurate downstream analytics.

—END—

backend architectureBig Datastream processingData Reliabilitylog platform
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.