JD Big Data Log Lifecycle and Alerting Best Practices
This article presents a comprehensive overview of JD's big‑data log lifecycle, covering background, platform capabilities, log collection methods, processing functions, storage strategies, query mechanisms, DSL extensions, data delivery, and alerting techniques to help engineers build efficient and reliable log management solutions.
The article introduces the speaker, Wang Qiu, a software development engineer at JD Technology, and outlines the purpose of sharing JD's big‑data log lifecycle and alerting guide.
It explains the background: JD's intelligent city operating system generates massive, heterogeneous logs from micro‑services, requiring a complete management system for low‑cost ingestion, processing, and issue detection, especially for private government projects.
The platform capabilities are described, including core functions such as log collection, storage, processing, delivery, and alerting. Various log collection techniques (non‑intrusive and intrusive) are compared, and three major log‑solution categories—ELK, Splunk, and Graylog—are evaluated, with Graylog chosen for its lower operational cost and built‑in alerting.
Non‑intrusive collection methods like Filebeat, Logstash, and Flume are discussed, highlighting Filebeat's low resource usage and back‑pressure support. The GELF log format is introduced, and both GELF‑Kafka and GELF‑UDP ingestion methods are explained.
The article details the log lifecycle: from ingestion layers and memory caches to Kafka‑based local caches, followed by business‑level processing that cleans, transforms, and indexes logs into Elasticsearch, supporting both real‑time and offline analytics.
SDK encapsulation is covered, showing how Logback, Log4j, and Log4j2 can embed GELF metadata, and how agents or sidecars can provide non‑intrusive collection with resource controls and automatic recovery.
Data processing functions—including data normalization, desensitization, and filtering—are provided via over 400 built‑in functions (e.g., GROK) and customizable scripts.
Storage strategies involve dynamic index management, hot‑cold architecture, and automated lifecycle policies using Curator to migrate aged data to cold nodes.
Query capabilities are presented, featuring SQL/DSL‑style queries, contextual searches, clustering, and third‑party integrations via HTTP/RPC.
The DSL extension supports fuzzy, wildcard, and aggregation queries, enabling flexible metric extraction.
Data delivery mechanisms export processed logs to Kafka for downstream real‑time or batch computation, illustrated with a use case of government collaboration software feeding logs into Flink SQL and MySQL for dashboard monitoring.
Alerting is described with event definitions, keyword and threshold‑based alerts, webhook notifications, and integration with messaging platforms.
The article concludes with a thank‑you note and calls for audience engagement.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.