Operations 21 min read

How to Build an Automated Kubernetes Inspection Platform with Bash and Prometheus

This article explains how to design and implement a Kubernetes platform inspection system that combines Bash scripts and Prometheus queries to monitor cluster health, core component status, and node resources, providing actionable alerts and a flexible automation framework.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Build an Automated Kubernetes Inspection Platform with Bash and Prometheus

What Is Platform Inspection

Platform inspection is a monitoring tool that evaluates the health of underlying systems, quickly identifying potential risks and offering remediation suggestions.

The tool scans various aspects of a cluster, including performance bottlenecks, component statuses, resource usage, and configuration issues, to improve stability and availability.

Why Inspection Matters

Even with metrics, logs, traces, Grafana, and alerts, inspection adds value by:

Supplementing monitoring for items like certificate expiration, Pod CIDR usage, Etcd and Velero backup status, which are easier to view via scripts than exporters.

Checking the health of Prometheus, VictoriaMetrics, and other components to ensure metrics are being collected.

Providing proactive problem discovery through centralized checks instead of inspecting each Grafana panel individually.

Kubernetes Inspection Key Metrics

The metrics are divided into three categories:

Cluster Overview

Core Component Status

Node Status

PromQL and Bash script contents must be configured for the actual environment.

Cluster Overview

Inspection Item: Node Usage

Description: Checks whether the cluster has spare resources.

Source: bash

<code>#!/bin/bash
set -o errexit
set -o nounset
node_sum=$(kubectl get nodes | awk 'NR>1' | grep -v master -c)
node_ready=$(kubectl get nodes | awk 'NR>1' | grep -v master | grep -v SchedulingDisabled -c)
echo "| ${node_ready}/${node_sum}"
if [[ $node_sum -gt $node_ready ]]; then
  echo "success"
else
  echo "warning"
fi</code>

Inspection Item: Pod Remaining Capacity

Description: Determines if there are Pods available for allocation.

Source: prometheus

<code>sum(kube_node_status_capacity{resource='pods'} * on(node) group_left(label_env) kube_node_labels{label_env=~"prod",cluster="core",zone=~"shanghai"} unless on(node) kube_node_role) -
sum(kube_pod_info * on(node) group_left(label_env) kube_node_labels{label_env=~"prod",cluster="core",zone=~"shanghai"} unless on(node) kube_node_role)</code>

Threshold:

["&lt;",90]

Inspection Item: Pod CIDR Usage

Description: Shows the number of IPs left for Pods.

Source: bash

<code>#!/bin/bash
set -o errexit
set -o nounset
pod_ip_free=$(calicoctl ipam show | grep '%' | awk '{print $12}')
echo "| IP 剩余数量:${pod_ip_free}"
if [[ $pod_ip_free -gt 500 ]]; then
  echo "success"
elif [[ $pod_ip_free -gt 100 ]]; then
  echo "warning"
else
  echo "error"
fi</code>

Inspection Item: Cluster CPU Usage

Source: prometheus

<code>(1 - avg(label_replace(rate(node_cpu_seconds_total{mode="idle",cluster="core",zone=~"shanghai"}[60s]),"internal_ip","$1","instance","(.+):(\\d+)")) and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai"} * on(node) group_left(internal_ip) kube_node_info) * 100</code>

Threshold:

[">",50]

Inspection Item: Cluster Memory Usage

Source: prometheus

<code>(1 - sum(label_replace(node_memory_MemAvailable_bytes{cluster="core",zone=~"shanghai"},"internal_ip","$1","instance","(.+):(\\d+)" ) and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai"} * on(node) group_left(internal_ip) kube_node_info)) / sum(label_replace(node_memory_MemTotal_bytes{cluster="core",zone=~"shanghai"},"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai"} * on(node) group_left(internal_ip) kube_node_info) * 100</code>

Threshold:

[">",85]

Inspection Item: Certificate Expiration

Source: bash

<code>#!/bin/bash
set -o errexit
set -o nounset
ct=$(date -d "$(openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates | awk -F '=' '/notAfter/{print $2}' | awk '{print $1,$2,$3,$4}')" +%s)
dt=$(date +%s)
expired=$(( (ct-dt)/(60*60*24) ))
echo "| ${expired} 天后过期"
if [[ $expired -gt 60 ]]; then
  echo "success"
elif [[ $expired -gt 15 ]]; then
  echo "warning"
else
  echo "error"
fi</code>

Inspection Item: Etcd Backup Status

Source: bash

<code>#!/bin/bash
set -o nounset
result=$(find /var/lib/docker/etcd_backup/ -mmin -120)
if [[ -n ${result} ]]; then
  echo "正常"
  echo "success"
else
  echo "异常"
  echo "error"
fi</code>

Inspection Item: Velero Backup Status

Source: bash

<code>#!/bin/bash
set -o nounset
current_date=$(date +%F)
backup_date=$(velero backup get | grep core-shanghai | awk '{print $5}' | sort -nr | head -1)
backup_date_2d=$(date -d "${backup_date} +2 days" +%F)
if [[ $backup_date_2d > $current_date && $backup_date != "" ]]; then
  echo "正常"
  echo "success"
else
  echo "异常"
  echo "error"
fi</code>

Core Component Status

etcd

Inspection Item: Insufficient etcd Nodes

Source: prometheusOr

<code>sum by(job) (up{job=~".*etcd.*",cluster="core",zone="shanghai"} == bool 1) < ((count by(job) (up{job=~".*etcd.*",cluster="core",zone="shanghai"}) + 1) / 2)</code>

Threshold: yes

Inspection Item: etcd Leader Presence

Source: prometheusOr

<code>etcd_server_has_leader{job=~".*etcd.*",cluster="core",zone="shanghai"} == 1</code>

Threshold: no

Inspection Item: Frequent etcd Leader Switches

Source: prometheusOr

<code>rate(etcd_server_leader_changes_seen_total{job=~".*etcd.*",cluster="core",zone="shanghai"}[15m]) > 3</code>

Threshold: yes

Inspection Item: etcd Request Success Rate

Source: prometheus

<code>100 - max(sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK",cluster="core",zone="shanghai"}[1m])) by (grpc_service) / sum(rate(grpc_server_started_total{grpc_type="unary",cluster="core",zone="shanghai"}[1m])) by (grpc_service) * 100.0)</code>

Threshold:

["&lt;",99]

Inspection Item: etcd Disk WAL Latency

Source: prometheus

<code>max(histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{cluster="core",zone="shanghai"}[1m])) by (instance,le))) * 1000</code>

Threshold:

[">",10]

kube-apiserver

Inspection Item: apiserver Health

Source: prometheus

<code>sum(up{job="apiserver",cluster="core",zone="shanghai"}) / count(up{job="apiserver",cluster="core",zone="shanghai"}) * 100</code>

Threshold:

["&lt;",90]

Inspection Item: apiserver QPS

Source: prometheus

<code>sum(rate(apiserver_request_total{cluster="core",zone="shanghai"}[1m]))</code>

Threshold:

[">",3000]

Inspection Item: apiserver Request Success Rate

Source: prometheus

<code>apiserver_request:availability30d{verb="all",cluster="core",zone="shanghai"} * 100</code>

Threshold:

["&lt;",99]

Inspection Item: apiserver Request Latency

Source: prometheus

<code>max(cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{cluster="core",zone="shanghai"})</code>

Threshold:

[">",1]

Node Status

kubelet

Inspection Item: Unready Nodes

Source: prometheusList

<code>sum by(node) (kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true",cluster="core",zone="shanghai"}) == 0</code>

Inspection Item: High PLEG Relist Duration

Source: prometheusList

<code>histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",metrics_path="/metrics",cluster="core",zone="shanghai"}[1m])) by (node,le)) * 1000 > 1000</code>

Resource Usage

Inspection Item: Nodes with CPU > 50%

Source: prometheusList

<code>(1 - avg by(internal_ip) (label_replace(rate(node_cpu_seconds_total{mode="idle",cluster="core",zone=~"shanghai"}[60s]),"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai",label_env=~"prod"} * on(node) group_left(internal_ip) kube_node_info)) * 100 > 50</code>

Inspection Item: Nodes with Memory > 80%

Source: prometheusList

<code>sum by(internal_ip) (label_replace(1 - (node_memory_MemAvailable_bytes{cluster="core",zone=~"shanghai"} / node_memory_MemTotal_bytes{cluster="core",zone=~"shanghai"}),"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai",label_env=~"prod"} * on(node) group_left(internal_ip) kube_node_info) * 100 > 80</code>

Inspection Item: Disk / Usage > 80%

Source: prometheusList

<code>sum by(internal_ip) (label_replace(100 - ((node_filesystem_avail_bytes{job="node-exporter",mountpoint="/",fstype!="rootfs",cluster="core",zone="shanghai"} * 100) / node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!="rootfs",cluster="core",zone="shanghai"}),"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai",label_env=~"prod"} * on(node) group_left(internal_ip) kube_node_info) > 80</code>

Inspection Item: PID Usage > 80%

Source: prometheusList

<code>label_replace(node_processes_threads{cluster="core",zone="shanghai"} / on(instance) min by(instance) (node_processes_max_processes or node_processes_max_threads{cluster="core",zone="shanghai"}),"internal_ip","$1","instance","(.+):(\\d+)") * 100 > 80</code>

Inspection Item: FD Usage > 70%

Source: prometheusList

<code>sum by(internal_ip) (label_replace(node_filefd_allocated{job="node-exporter",cluster="core",zone="shanghai"} * 100 / node_filefd_maximum{job="node-exporter",cluster="core",zone="shanghai"},"internal_ip","$1","instance","(.+):(\\d+)") ) > 70</code>

Inspection Item: Time Sync Issues

Source: prometheusList

<code>min_over_time(node_timex_sync_status{cluster="core",zone="shanghai"}[5m]) == 0 and node_timex_maxerror_seconds{cluster="core",zone="shanghai"} >= 16</code>

Inspection Item: DockerHung Pods

Source: prometheusList

<code>sum by(node) (rate(problem_counter{reason="DockerHung",cluster="core",zone="shanghai"}[1m])) > 0</code>

Automated Inspection Platform

The "action source" field in each item can be bash, prometheus, prometheusOr, or prometheusList. Bash scripts reside on the K8s master node and return a result line and a status line (success, warning, error). Prometheus‑based items query metrics and compare them with thresholds.

All execution commands and script names are stored in a MySQL table; adding a new inspection item only requires inserting a rule into the table.

Inspection platform schema
Inspection platform schema
Inspection platform UI
Inspection platform UI
Note: PromQL must be URL‑encoded.

Core pseudo‑code (simplified):

<code>var mu sync.Mutex

type ScannerRequest struct {
    CheckKeys []int `json:"check_keys"`
    SelectedCluster int `json:"selected_cluster"`
}

func (s *ScannerController) ScannerStart(g *gin.Context) {
    mu.Lock()
    defer mu.Unlock()
    s.store.UpdateAllStatus()
    var r ScannerRequest
    if err := g.ShouldBindJSON(&r); err != nil {
        v2api.AbnormalJsonResponse(g, "", "body parse error: "+err.Error())
        return
    }
    // Load cluster info from JSON strings into a map
    // Determine which scanner items to run based on CheckKeys
    // For each item launch a goroutine that:
    //   - Retrieves action_type, action_detail, threshold from DB
    //   - Replaces placeholders (%22core%22, %22shanghai%22) with actual cluster name/zone
    //   - Executes the appropriate logic (prometheus, prometheusOr, prometheusList, bash)
    //   - Updates DB with value and status (success, warning, error)
    v2api.NormalJsonResponse(g, "开始巡检", "")
}

// Helper functions for Prometheus queries, SSH execution, etc.
</code>

Page display screenshots illustrate the UI of the inspection platform.

UI screenshot 1
UI screenshot 1
UI screenshot 2
UI screenshot 2
monitoringautomationkubernetesPrometheusBashPlatform Inspection
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.