Design and Implementation of 360 Container Platform Monitoring System
The article describes how 360 built a Kubernetes‑based container platform monitoring system using Prometheus, ELK, Grafana and custom components, detailing its architecture, monitoring dimensions, log collection, alerting, selection rationale, high‑availability design, and future evolution for scalable cloud‑native operations.
Background
360 launched a container cloud platform that brings convenience to development teams but also introduces operational challenges. The legacy monitoring system, built on Open‑Falcon, could not keep up with dynamic services created by Kubernetes, prompting the development of a new monitoring system that supports service discovery.
360 currently operates five Kubernetes clusters across Beijing, Shanghai, and Shenzhen, plus several GPU clusters.
The container platform brings several benefits:
Resource savings: a single machine can host dozens of services with containers.
Efficiency: no budget approval needed to allocate resources; elastic scaling handles traffic spikes and idle periods.
High availability: the platform ensures the expected number of service instances are always running.
Reduced operational burden: developers can build images and deploy themselves without relying on ops.
However, the platform also introduces challenges, such as the need for a monitoring system that can dynamically detect where service instances are scheduled.
360 Container Platform Monitoring Dimensions
The monitoring system observes technical metrics at three levels:
Container – the smallest granularity.
Pod – a group of one or more containers.
Application – may consist of multiple Pods.
In addition to built‑in metrics, the system supports custom monitoring via Prometheus SDKs embedded in applications or sidecar exporters that expose process metrics to Prometheus.
Monitoring System Design
Architecture diagram (image omitted for brevity).
Log Monitoring
The log monitoring component is built on an ELK stack with custom development. A Log Controller watches Kubernetes Deployment resources, detects when Pods become Ready, constructs the absolute log file path on the host, and pushes the configuration to 360’s internal configuration center QConf.
Each Node runs a customized Logstash that periodically pulls the latest configuration from QConf, collects logs, and forwards them to Qbus (a Kafka‑based product). Users can enable Elasticsearch services on the private HULK cloud platform for log search and analysis.
Data Dashboard
The dashboard uses Grafana with LDAP authentication and displays metrics such as memory, traffic, GPU usage, etc. It shows application‑level, Pod‑level, and container‑level baseline metrics (images omitted).
Prometheus collects metrics via HTTP pull; any component exposing an exporter endpoint can be monitored without additional SDKs.
Alerting
Alerting is based on Prometheus Alertmanager, extended with 360’s own Qalram system and integrated with the internal instant‑messaging tool Lianxin for real‑time notifications.
Monitoring System Selection Comparison
During platform construction, 360 evaluated open‑source solutions such as Heapster+InfluxDB and Prometheus. Heapster lacked business‑level monitoring and scalability, leading the team to choose the cloud‑native Prometheus solution, which better fits dynamic container environments.
Prometheus Application Practice
High‑availability is achieved by deploying a Prometheus instance per data center and two upper‑level Prometheus instances for HA. Upper‑level instances aggregate data, filter unnecessary metrics, and forward alerts. Long‑term data is stored in InfluxDB via remote write.
Alert rules are templated; users fill in threshold values. Business‑specific alerts require writing PromQL expressions. Custom alerts are stored in QConf and watched by agents that reload Prometheus when changes occur.
Why Use Open‑Source Systems
Prometheus is the de‑facto standard in large‑scale Kubernetes environments and reduces manpower costs by leveraging community support.
Future Directions of the Monitoring System
As cluster size grows, a single Prometheus may become a bottleneck; the plan is to deploy multiple Prometheus instances per cluster, each handling a subset of jobs, and to add remote storage for long‑term data. Exploration of AIOps for fault localization and root‑cause analysis is also underway.
How Ordinary Teams Can Build a Monitoring System
Monitoring solutions should be tailored to company size and scenario. Small companies can use public‑cloud monitoring services or deploy a lightweight HA monitoring stack. Architecture evolves with business growth.
Author Bio
Wang Xigang – Currently works on operations development at 360, with extensive experience in PaaS design, Kubernetes, Docker, and is responsible for the 360 HULK container cloud platform.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.