Operations 6 min read

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

The Industrial and Commercial Bank of China software development center created an SRE panoramic monitoring view system that unifies data channels, standardizes metrics, offers multi‑dimensional dashboards, and introduces an intelligent Ops Assistant, dramatically improving fault detection, response speed, and cross‑team operational efficiency.

Efficient Ops
Efficient Ops
Efficient Ops
How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

ICBC's Software Development Center built an SRE panoramic monitoring system with the goals of unified channel access, standardized data, rich dashboards, and hierarchical drill‑down, enabling full‑type monitoring data integration and a data‑standard architecture that ensures high availability and significantly improves emergency response, cross‑team inspections, and dev‑ops collaboration.

01 Custom Multi‑Dimensional Monitoring Views for Rapid Fault Location

Overall Layer – Business Operations Observation Center : Aggregates central monitoring, key departmental metrics, fast navigation, four‑level production alerts, and operational guarantees, allowing quick insight into overall production status and rapid drill‑down to block‑level views.

Business Block Layer – Business Perspective Runtime Information : Shows business operation indicators, key technical metrics, and static link diagrams, including application‑level navigation, block‑level overview screens, critical business scenarios, and transaction link maps, enhancing user experience and efficiency in accessing block‑level runtime data.

Development Department Layer – Full‑View Application Trend Perception : Tailored to departmental applications, displaying large transactions and slow SQL in databases, batch job status and trends, and key production applications, strengthening monitoring and supervision of departmental applications.

Application Layer – Cross‑Team Aggregated Runtime Information : From an application perspective, aggregates node metrics, draws intelligent baselines for batch jobs in the data middle platform, and provides early warnings based on dynamic baselines, greatly enhancing operational awareness for both developers and ops teams.

02 Introducing the "Ops Assistant" to Boost Production Reach Efficiency

The Ops Assistant, integrated into the panoramic view, uses intelligent Q&A to provide real‑time production information such as emergency plans, on‑call contacts, and CMDB configurations, linking monitoring with emergency response and helping developers analyze fault causes, track remediation progress, and understand recovery status.

Through resource centralization and dev‑ops co‑construction, ICBC has achieved near‑complete coverage of SRE panoramic monitoring views across key product lines, department‑level, and development‑department‑level deployments, providing detailed data for fault prediction, post‑mortem analysis, and meeting the "1‑5‑10" emergency response requirements.

Future plans include expanding monitoring scope, enhancing support for innovation‑transformation components, improving root‑cause location in fault scenarios, and developing more comprehensive inspection mechanisms to further strengthen the SRE panoramic monitoring system’s role in ensuring production safety.

monitoringoperationsObservabilitySREdigital transformationICBC
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.