Big Data 11 min read

Scalable, High‑Accuracy Event Logging Monitoring for Baidu's Log Platform

Baidu’s log platform processes billions of daily page‑view events and, to monitor them accurately with minute‑level latency, implements a downstream streaming‑task architecture that maps limited custom dimensions, uses watermarks for completeness, trims raw data, aggregates into 5‑minute windows, and outputs concise metrics to Elasticsearch, achieving high accuracy, configurability, and low cost.

Baidu Tech Salon

Jun 18, 2024

Scalable, High‑Accuracy Event Logging Monitoring for Baidu's Log Platform

The article introduces Baidu's log platform, which handles billions of page‑view (PV) events daily, and discusses the challenges of accurately monitoring such massive event streams while supporting customizable field extraction and drill‑down.

It first explains the UBC (User Behavior Collection) protocol, the three log types (UBC client logs, UBC server logs, UBC H5 logs), and the distinction between event‑type and stream‑type logging, as well as the concepts of public parameters (system‑level fields) and business parameters (custom fields defined per UBC ID).

Based on these concepts, the article derives three core monitoring requirements: minute‑level latency, appropriate statistical metrics (PV for events, PV + duration for streams), and flexible filtering on both public and business parameters.

The current real‑time log pipeline is described: client SDKs send logs to dedicated ingestion servers, logs are persisted, forwarded to a message queue, and processed by streaming jobs. The authors argue that using this online path for monitoring would either couple monitoring tightly with business logic or suffer from data loss/latency, so they propose a downstream streaming‑task‑based monitoring approach that sacrifices sub‑second latency for minute‑level latency but gains maintainability.

Key technical solutions include:

Dimension mapping: limiting the number of custom filter dimensions to six and supporting 1‑to‑1 and many‑to‑1 mappings to prevent dimension explosion.

Watermark (water level) usage: employing processing timestamps as the x‑axis and advancing only when a watermark indicates data completeness, thus eliminating data‑shift caused by upstream service hiccups.

Cost reduction via data trimming: after mapping, raw fields are discarded, reducing record size from ~10 KB to ~0.2 KB (≈98 % reduction).

Time‑window aggregation: aggregating trimmed data into 5‑minute windows, storing only count (PV) and sum (duration), which compresses billions of raw records to under 100 k aggregated rows (≈99.98 % reduction).

The architecture consists of a metadata management platform, streaming processing tasks, a monitoring message queue, and monitoring aggregation tasks that output results to Elasticsearch and backup storage.

Finally, the article reflects on the importance of measurement in software engineering, summarizing how the proposed monitoring solution achieves high accuracy, configurability, and low operational cost, and outlines future plans to continue improving reliability and usability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Stream Processing Real-time Analytics Watermark Log Monitoring dimension mapping UBC

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.