Operations 21 min read

How to Build an Automated Kubernetes Inspection Platform with Bash and Prometheus

This article explains how to design and implement a Kubernetes platform inspection system that combines Bash scripts and Prometheus queries to monitor cluster health, core component status, and node resources, providing actionable alerts and a flexible automation framework.

Ops Development Stories

Aug 23, 2024

How to Build an Automated Kubernetes Inspection Platform with Bash and Prometheus

What Is Platform Inspection

Platform inspection is a monitoring tool that evaluates the health of underlying systems, quickly identifying potential risks and offering remediation suggestions.

The tool scans various aspects of a cluster, including performance bottlenecks, component statuses, resource usage, and configuration issues, to improve stability and availability.

Why Inspection Matters

Even with metrics, logs, traces, Grafana, and alerts, inspection adds value by:

Supplementing monitoring for items like certificate expiration, Pod CIDR usage, Etcd and Velero backup status, which are easier to view via scripts than exporters.

Checking the health of Prometheus, VictoriaMetrics, and other components to ensure metrics are being collected.

Providing proactive problem discovery through centralized checks instead of inspecting each Grafana panel individually.

Kubernetes Inspection Key Metrics

The metrics are divided into three categories:

Cluster Overview

Core Component Status

Node Status

PromQL and Bash script contents must be configured for the actual environment.

Cluster Overview

Inspection Item: Node Usage

Description: Checks whether the cluster has spare resources.

Source: bash

#!/bin/bash
set -o errexit
set -o nounset
node_sum=$(kubectl get nodes | awk 'NR>1' | grep -v master -c)
node_ready=$(kubectl get nodes | awk 'NR>1' | grep -v master | grep -v SchedulingDisabled -c)
echo "| ${node_ready}/${node_sum}"
if [[ $node_sum -gt $node_ready ]]; then
  echo "success"
else
  echo "warning"
fi

Inspection Item: Pod Remaining Capacity

Description: Determines if there are Pods available for allocation.

Source: prometheus

sum(kube_node_status_capacity{resource='pods'} * on(node) group_left(label_env) kube_node_labels{label_env=~"prod",cluster="core",zone=~"shanghai"} unless on(node) kube_node_role) -
sum(kube_pod_info * on(node) group_left(label_env) kube_node_labels{label_env=~"prod",cluster="core",zone=~"shanghai"} unless on(node) kube_node_role)

Threshold: ["<",90] Inspection Item: Pod CIDR Usage

Description: Shows the number of IPs left for Pods.

Source: bash

#!/bin/bash
set -o errexit
set -o nounset
pod_ip_free=$(calicoctl ipam show | grep '%' | awk '{print $12}')
echo "| IP 剩余数量：${pod_ip_free}"
if [[ $pod_ip_free -gt 500 ]]; then
  echo "success"
elif [[ $pod_ip_free -gt 100 ]]; then
  echo "warning"
else
  echo "error"
fi

Inspection Item: Cluster CPU Usage

Source: prometheus

(1 - avg(label_replace(rate(node_cpu_seconds_total{mode="idle",cluster="core",zone=~"shanghai"}[60s]),"internal_ip","$1","instance","(.+):(\\d+)")) and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai"} * on(node) group_left(internal_ip) kube_node_info) * 100

Threshold: [">",50] Inspection Item: Cluster Memory Usage

Source: prometheus

(1 - sum(label_replace(node_memory_MemAvailable_bytes{cluster="core",zone=~"shanghai"},"internal_ip","$1","instance","(.+):(\\d+)" ) and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai"} * on(node) group_left(internal_ip) kube_node_info)) / sum(label_replace(node_memory_MemTotal_bytes{cluster="core",zone=~"shanghai"},"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai"} * on(node) group_left(internal_ip) kube_node_info) * 100

Threshold: [">",85] Inspection Item: Certificate Expiration

Source: bash

#!/bin/bash
set -o errexit
set -o nounset
ct=$(date -d "$(openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates | awk -F '=' '/notAfter/{print $2}' | awk '{print $1,$2,$3,$4}')" +%s)
dt=$(date +%s)
expired=$(( (ct-dt)/(60*60*24) ))
echo "| ${expired} 天后过期"
if [[ $expired -gt 60 ]]; then
  echo "success"
elif [[ $expired -gt 15 ]]; then
  echo "warning"
else
  echo "error"
fi

Inspection Item: Etcd Backup Status

Source: bash

#!/bin/bash
set -o nounset
result=$(find /var/lib/docker/etcd_backup/ -mmin -120)
if [[ -n ${result} ]]; then
  echo "正常"
  echo "success"
else
  echo "异常"
  echo "error"
fi

Inspection Item: Velero Backup Status

Source: bash

#!/bin/bash
set -o nounset
current_date=$(date +%F)
backup_date=$(velero backup get | grep core-shanghai | awk '{print $5}' | sort -nr | head -1)
backup_date_2d=$(date -d "${backup_date} +2 days" +%F)
if [[ $backup_date_2d > $current_date && $backup_date != "" ]]; then
  echo "正常"
  echo "success"
else
  echo "异常"
  echo "error"
fi

Core Component Status

etcd

Inspection Item: Insufficient etcd Nodes

Source: prometheusOr

sum by(job) (up{job=~".*etcd.*",cluster="core",zone="shanghai"} == bool 1) < ((count by(job) (up{job=~".*etcd.*",cluster="core",zone="shanghai"}) + 1) / 2)

Threshold: yes

Inspection Item: etcd Leader Presence

Source: prometheusOr

etcd_server_has_leader{job=~".*etcd.*",cluster="core",zone="shanghai"} == 1

Threshold: no

Inspection Item: Frequent etcd Leader Switches

Source: prometheusOr

rate(etcd_server_leader_changes_seen_total{job=~".*etcd.*",cluster="core",zone="shanghai"}[15m]) > 3

Threshold: yes

Inspection Item: etcd Request Success Rate

Source: prometheus

100 - max(sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK",cluster="core",zone="shanghai"}[1m])) by (grpc_service) / sum(rate(grpc_server_started_total{grpc_type="unary",cluster="core",zone="shanghai"}[1m])) by (grpc_service) * 100.0)

Threshold: ["<",99] Inspection Item: etcd Disk WAL Latency

Source: prometheus

max(histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{cluster="core",zone="shanghai"}[1m])) by (instance,le))) * 1000

Threshold:

[">",10]

kube-apiserver

Inspection Item: apiserver Health

Source: prometheus

sum(up{job="apiserver",cluster="core",zone="shanghai"}) / count(up{job="apiserver",cluster="core",zone="shanghai"}) * 100

Threshold: ["<",90] Inspection Item: apiserver QPS

Source: prometheus

sum(rate(apiserver_request_total{cluster="core",zone="shanghai"}[1m]))

Threshold: [">",3000] Inspection Item: apiserver Request Success Rate

Source: prometheus

apiserver_request:availability30d{verb="all",cluster="core",zone="shanghai"} * 100

Threshold: ["<",99] Inspection Item: apiserver Request Latency

Source: prometheus

max(cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{cluster="core",zone="shanghai"})

Threshold:

[">",1]

Node Status

kubelet

Inspection Item: Unready Nodes

Source: prometheusList

sum by(node) (kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true",cluster="core",zone="shanghai"}) == 0

Inspection Item: High PLEG Relist Duration

Source: prometheusList

histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",metrics_path="/metrics",cluster="core",zone="shanghai"}[1m])) by (node,le)) * 1000 > 1000

Resource Usage

Inspection Item: Nodes with CPU > 50%

Source: prometheusList

(1 - avg by(internal_ip) (label_replace(rate(node_cpu_seconds_total{mode="idle",cluster="core",zone=~"shanghai"}[60s]),"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai",label_env=~"prod"} * on(node) group_left(internal_ip) kube_node_info)) * 100 > 50

Inspection Item: Nodes with Memory > 80%

Source: prometheusList

sum by(internal_ip) (label_replace(1 - (node_memory_MemAvailable_bytes{cluster="core",zone=~"shanghai"} / node_memory_MemTotal_bytes{cluster="core",zone=~"shanghai"}),"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai",label_env=~"prod"} * on(node) group_left(internal_ip) kube_node_info) * 100 > 80

Inspection Item: Disk / Usage > 80%

Source: prometheusList

sum by(internal_ip) (label_replace(100 - ((node_filesystem_avail_bytes{job="node-exporter",mountpoint="/",fstype!="rootfs",cluster="core",zone="shanghai"} * 100) / node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!="rootfs",cluster="core",zone="shanghai"}),"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai",label_env=~"prod"} * on(node) group_left(internal_ip) kube_node_info) > 80

Inspection Item: PID Usage > 80%

Source: prometheusList

label_replace(node_processes_threads{cluster="core",zone="shanghai"} / on(instance) min by(instance) (node_processes_max_processes or node_processes_max_threads{cluster="core",zone="shanghai"}),"internal_ip","$1","instance","(.+):(\\d+)") * 100 > 80

Inspection Item: FD Usage > 70%

Source: prometheusList

sum by(internal_ip) (label_replace(node_filefd_allocated{job="node-exporter",cluster="core",zone="shanghai"} * 100 / node_filefd_maximum{job="node-exporter",cluster="core",zone="shanghai"},"internal_ip","$1","instance","(.+):(\\d+)") ) > 70

Inspection Item: Time Sync Issues

Source: prometheusList

min_over_time(node_timex_sync_status{cluster="core",zone="shanghai"}[5m]) == 0 and node_timex_maxerror_seconds{cluster="core",zone="shanghai"} >= 16

Inspection Item: DockerHung Pods

Source: prometheusList

sum by(node) (rate(problem_counter{reason="DockerHung",cluster="core",zone="shanghai"}[1m])) > 0

Automated Inspection Platform

The "action source" field in each item can be bash, prometheus, prometheusOr, or prometheusList. Bash scripts reside on the K8s master node and return a result line and a status line (success, warning, error). Prometheus‑based items query metrics and compare them with thresholds.

All execution commands and script names are stored in a MySQL table; adding a new inspection item only requires inserting a rule into the table.

Note: PromQL must be URL‑encoded.

Core pseudo‑code (simplified):

var mu sync.Mutex

type ScannerRequest struct {
    CheckKeys []int `json:"check_keys"`
    SelectedCluster int `json:"selected_cluster"`
}

func (s *ScannerController) ScannerStart(g *gin.Context) {
    mu.Lock()
    defer mu.Unlock()
    s.store.UpdateAllStatus()
    var r ScannerRequest
    if err := g.ShouldBindJSON(&r); err != nil {
        v2api.AbnormalJsonResponse(g, "", "body parse error: "+err.Error())
        return
    }
    // Load cluster info from JSON strings into a map
    // Determine which scanner items to run based on CheckKeys
    // For each item launch a goroutine that:
    //   - Retrieves action_type, action_detail, threshold from DB
    //   - Replaces placeholders (%22core%22, %22shanghai%22) with actual cluster name/zone
    //   - Executes the appropriate logic (prometheus, prometheusOr, prometheusList, bash)
    //   - Updates DB with value and status (success, warning, error)
    v2api.NormalJsonResponse(g, "开始巡检", "")
}

// Helper functions for Prometheus queries, SSH execution, etc.

Page display screenshots illustrate the UI of the inspection platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Prometheus Bash Platform Inspection

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.