How Visualized Full‑Chain Log Tracing Transforms Complex Business Systems
This article explains a new visualized full‑chain log tracing solution that organizes business logs by logical flow, dynamically links them during execution, and provides a visual, searchable view of the entire business process, dramatically improving issue localization in large‑scale distributed systems.
1. Background
1.1 Business systems are becoming increasingly complex
Rapid development of internet products, changing business environments and user demands have led to ever‑more intricate business requirements. Systems now support a growing number of scenarios and increasingly complex logic, especially with micro‑service architectures that require coordination across many services.
1.2 Challenges of business tracing
Business tracing is essential for reproducing the execution context of a request, analyzing logic, and locating problems. Traditional log‑based ELK solutions and distributed session‑tracking approaches struggle to keep up as business logic grows more complex.
1.2.1 Traditional ELK solution
Logs record discrete events during program execution and are collected in Elasticsearch (ES). Developers must log extensively, then filter and stitch logs manually, which is time‑consuming and error‑prone.
Key pain points:
Log collection is cumbersome : ES can search logs, but logs are often unstructured text, making comprehensive collection difficult.
Log filtering is hard : Overlapping business logic produces interleaved logs that are hard to separate.
Log analysis is slow : Discrete logs must be manually correlated with code to reconstruct the execution scene.
1.2.2 Distributed session‑tracking solution
Based on Google’s Dapper paper and open‑source Zipkin, this approach assigns a globally unique traceId to a request and builds a call chain. It excels at analyzing call behavior but cannot fully describe business logic, especially when multiple parallel calls or conditional branches are involved.
Limitations:
Cannot trace multiple call chains simultaneously.
Cannot depict the full business logic panorama.
Focuses on system‑wide call paths, adding noise when only the current business logic matters.
1.2.3 Summary
Both ELK and session‑tracking are inadequate for modern, complex business tracing. A new solution is needed that centers on business logic, organizes logs by logical flow, and visualizes the execution scene.
2. Visualized Full‑Chain Log Tracing
2.1 Design Idea
The solution organizes logs by business logic during execution, producing a visual representation of the execution scene.
Problem 1: How to efficiently organize business logs?
Define a "logic node" for each independent business unit (local method or RPC) and combine nodes into a "logic chain" that represents a complete business scenario. Tracing a business execution is equivalent to reproducing a specific instance of a logic chain.
Problem 2: How to dynamically link business logs?
By propagating a unique identifier (business ID + scenario ID + execution ID) through threads and network calls, logs are "colored" and dynamically attached to the executing node, gradually forming a complete chain.
2.2 General Solution
The solution consists of four steps:
Link definition – describe the logic chain using a DSL (JSON/XML) that specifies nodes, their types, and execution rules (serial, parallel, conditional).
Link coloring – assign a unique link identifier at the start of execution and propagate it to each node.
Link reporting – report node logs and business logs in a structured format.
Link storage – store link, node, and business logs in a tree‑structured model (e.g., HBase) for later reconstruction.
2.2.1 Link Definition
A DSL describes nodes (rpc, local, fork, join, decision) and their relationships. Example DSL (excerpt):
<code>[
{"nodeName":"A","nodeType":"rpc"},
{"nodeName":"Fork","nodeType":"fork","forkNodes":[
[{"nodeName":"B","nodeType":"rpc"}],
[{"nodeName":"C","nodeType":"local"}]
]},
{"nodeName":"Join","nodeType":"join","joinOnList":["B","C"]},
{"nodeName":"D","nodeType":"decision","decisionCases":{"true":[{"nodeName":"E","nodeType":"rpc"}],"defaultCase":[{"nodeName":"F","nodeType":"rpc"}]}}
]
</code>2.2.2 Link Coloring
Two steps:
Determine a unique link identifier (business ID + scenario ID + execution ID).
Propagate the identifier so that each node’s logs are attached to the correct link.
2.2.3 Link Reporting
Report two types of logs:
Node logs : start/end timestamps, status, input/output.
Business logs : level, timestamp, and data relevant to business logic.
2.2.4 Link Storage
Store logs in a tree‑structured model (link → node → business) using a scalable store such as HBase.
3. Meituan‑Dianping Content Platform Practice
3.1 Business Characteristics and Challenges
The platform handles millions of content items daily, supporting numerous scenarios (real‑time ingestion, manual operation, distribution recomputation, etc.) and billions of logical node executions. Logs are scattered across services, making collection and reconstruction extremely difficult.
3.2 Practice and Results
3.2.1 Practice
Implemented a log pipeline: log_agent → Kafka → Flink → HBase, supporting high‑volume log ingestion and processing.
Developed a custom TraceLogger library (built on SLF4J) that abstracts log collection, node reporting, and exception handling, minimizing code changes.
Example of replacing a traditional log call with a full‑chain log call:
<code>// Before
LOGGER.error("update struct failed, param:{}", GsonUtils.toJson(structRequest), e);
// After
TraceLogger.error("update struct failed, param:{}", GsonUtils.toJson(structRequest), e);
</code>Node logging can be done via API calls or AOP annotations, e.g.:
<code>public Response realTimeInputLink(long contentId) {
TraceUtils.passLinkMark("contentId_type_uuid"); // start link
TraceUtils.reportNode("contentStore", contentId, StatusEnums.RUNNING);
contentStore(contentId);
TraceUtils.reportNode("contentStore", structResp, StatusEnums.COMPLETED);
// remote call
Response processResp = picProcess(contentId);
}
@TraceNode(nodeName="picProcess")
public Response picProcess(long contentId) {
TraceLogger.warn("picProcess failed, contentId:{}", contentId);
}
</code>3.2.2 Results
The platform now provides:
Link query : real‑time retrieval of all logical chains for a given content ID.
Link visualization : graphical view of the full business logic panorama with node status.
Node detail view : input, output, and associated business logs for any executed node.
Issue‑resolution time dropped from hours to under five minutes, and testing efficiency improved significantly.
4. Summary and Outlook
Observability is increasingly critical for complex distributed systems. The visualized full‑chain log tracing solution combines logging, metrics, and tracing to provide end‑to‑end visibility, low integration cost, broad coverage, and high operational efficiency. Future work will extend the observability stack with alerting, dashboards, and deeper analysis for complex business systems.
Sanyou's Java Diary
Passionate about technology, though not great at solving problems; eager to share, never tire of learning!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.