How to Monitor etcd in Kubernetes: Metrics, Prometheus, and Sysdig
This article explains what etcd is, outlines common failure points, and provides step‑by‑step instructions for collecting etcd metrics via curl, configuring Prometheus scraping, creating alerts, and using Sysdig Monitor to observe key health indicators in a Kubernetes environment.
etcd is a distributed key‑value store that backs the Kubernetes control plane, storing cluster state such as pods, secrets, and deployments. It offers a simple JSON/gRPC API and relies on the Raft consensus algorithm for consistency and fault tolerance.
The article first introduces etcd’s purpose and the three possible node roles (follower, candidate, leader) and explains how leader election and log commitment work.
It then lists common failure points for an etcd cluster, emphasizing that a loss of the leader or a complete outage can render the entire Kubernetes cluster unusable.
Collecting etcd metrics
By default etcd exposes Prometheus‑compatible metrics on port 4001 of the master node. Access requires client‑certificate authentication. Example curl commands:
curl https://localhost:4001/metrics -k --cert /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt --key /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key curl https://[master_ip]:4001/metrics -k --cert /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt --key /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.keyThe response contains a long list of metric families, e.g. # HELP etcd_disk_backend_snapshot_duration_seconds and histogram buckets.
Prometheus configuration
To scrape etcd metrics, create a secret with the client certificates and mount it into the Prometheus server deployment:
kubectl -n monitoring create secret generic etcd-ca --from-file=etcd-clients-ca.key --from-file=etcd-clients-ca.crt kubectl -n monitoring patch deployment prometheus-server -p '{"spec":{"template":{"spec":{"volumes":[{"name":"etcd-ca","secret":{"defaultMode":420,"secretName":"etcd-ca"}}]}}}}' kubectl -n monitoring patch deployment prometheus-server -p '{"spec":{"template":{"spec":{"containers":[{"name":"prometheus-server","volumeMounts":[{"mountPath":"/opt/prometheus/secrets","name":"etcd-ca"}]}]}}}}'Then add a scrape_configs job for etcd, using TLS settings that point to the mounted certificates.
scrape_configs:
- job_name: etcd
scheme: https
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
separator: '/'
regex: 'kube-system/etcd-manager-main.+'
action: keep
...
tls_config:
insecure_skip_verify: true
cert_file: /opt/prometheus/secrets/etcd-clients-ca.crt
key_file: /opt/prometheus/secrets/etcd-clients-ca.keyKey alerting expressions include node availability, leader presence, leader change frequency, and proposal failure rates, e.g.:
sum(up{job="etcd"}) # HELP etcd_server_has_leader Whether or not a leader exists. 1 = exists, 0 = none
# TYPE etcd_server_has_leader gauge
etcd_server_has_leader 1 # HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
etcd_server_leader_changes_seen_total 1 rate(etcd_server_proposals_failed_total{job=~"etcd"}[15m]) > 5Other important metrics cover proposal counts (applied, committed, failed, pending) and disk latency histograms such as etcd_disk_backend_commit_duration_seconds and etcd_disk_wal_fsync_duration_seconds . Histogram quantiles can be used to assess 99th‑percentile latency.
Monitoring etcd with Sysdig Monitor
Deploy a Prometheus instance (namespace monitoring ) and then configure the Sysdig agent to scrape the etcd job. Example deployment steps:
kubectl create ns monitoring
helm install -f values.yaml prometheus -n monitoring stable/prometheusProvide a values.yaml with appropriate scrape annotations, then create a ConfigMap for the Sysdig agent that adds a sysdig_sd_configs job filtering for etcd pods.
apiVersion: v1
kind: ConfigMap
metadata:
name: sysdig-agent
namespace: sysdig-agent
data:
prometheus.yaml: |-
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
honor_labels: true
metrics_path: '/federate'
metric_relabel_configs:
- regex: 'kubernetes_pod_name'
action: labeldrop
params:
'match[]':
- '{sysdig="true"}'
sysdig_sd_configs:
- tags:
namespace: monitoring
deployment: prometheus-serverWith these configurations, Sysdig Monitor collects all etcd metrics, allowing you to set alerts on the most critical health indicators before a cluster‑wide failure occurs.
Conclusion
etcd is a simple yet powerful component required for any Kubernetes deployment. Although Raft provides resilience against many transient issues, proactive monitoring and alerting of etcd health—node availability, leader status, proposal failures, and disk latency—are essential to keep the control plane reliable.
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.