Building a Dynamic Grafana Dashboard for Push System TraceId Visualization
This article describes how to use Grafana's Flowcharting plugin and Prometheus metrics to create a dynamic, interactive dashboard that visualizes each logical node of a push notification pipeline, enabling rapid trace‑ID based troubleshooting and reducing manual investigation effort.
Background : The ZuanZuan push system is a self‑developed service that routes push notifications through multiple MQ hops, filtering, do‑not‑disturb policies, and vendor channels before issuing HTTP requests to devices. Frequent reports of undelivered pushes required time‑consuming manual tracing across many clusters.
Idea Origin : By leveraging traceId propagation (generated by Radar + Zipkin) and Prometheus’s low‑overhead counters, the team aimed to visualize both normal and error nodes in the push flow, allowing instant identification of failure points.
What is a Dynamic View : Grafana’s Flowcharting plugin (built on draw.io) lets users draw complex diagrams—architecture, UML, workflows—and bind them to live data, supporting status monitoring, interactive links, conditional styling, and regex‑based transformations.
Building the Dashboard :
4.1 Draw the View : A flowchart of all push‑pipeline nodes was created, with green nodes for normal flow and yellow nodes for exceptions, each assigned a unique status code.
4.2 Report Data : A shared Counter was defined in a common JAR:
private static final Counter NODE_COUNTER = Counter.build()
.name("push_link_graph_node_monitor")
.help("push链路节点监控")
.labelNames("nodeName", "traceId")
.disableAutoCreateGraph(true)
.register();And a helper method to emit metrics:
public static void reportNodeInfoStrWithTraceId(String nodeName, String traceId) {
try {
if (StringUtils.isBlank(traceId)) {
traceId = com.bj58.zhuanzhuan.radar.util.RadarUtils.getTraceId();
}
NODE_COUNTER.labels(nodeName, traceId).inc();
} catch (Exception e) {
// DO NOTHING
}
return;
}4.3 Create the Grafana Dashboard : A new dashboard was added, basic information (name, tags, time range) and a variable traceId (manual input) were configured.
4.4 Import the Diagram : The drawn XML was copied from draw.io and pasted into the Flowcharting panel.
4.5 Write PromQL to fetch node counts per traceId:
increase(push_link_graph_node_monitor{traceId="${traceId}"}[$__rate_interval])4.6 Define Mappings : Color/tooltip, label/text, link, and event/animation mappings were set so that, for example, nodes with low order counts turn red or flash when a threshold is met.
Results and Benefits :
5.1 Outcomes : The flowchart with real‑time data highlights the exact node where a push failed; clicking a node opens its status‑code documentation, enabling instant diagnosis (e.g., APNs 400 BadDeviceToken).
5.2 Impact : Average investigation time dropped from >0.25 h per incident to near‑zero, and daily inquiry frequency was eliminated, dramatically cutting manpower costs.
Promotion : Any service with a process flow can adopt this approach to build similar dynamic views for faster troubleshooting.
Acknowledgements : Thanks to colleagues Wang Jianxin and Zhao Hao for their guidance on data collection and dynamic view construction.
References : Includes internal design documents, a Prometheus‑Grafana case study, the Flowcharting repository, and introductory PromQL tutorials.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.