Operations 13 min read

How to Monitor etcd in Kubernetes: Metrics, Prometheus, and Sysdig

This article explains what etcd is, outlines common failure points, and provides step‑by‑step instructions for collecting etcd metrics via curl, configuring Prometheus scraping, creating alerts, and using Sysdig Monitor to observe key health indicators in a Kubernetes environment.

Cloud Native Technology Community

Oct 12, 2020

How to Monitor etcd in Kubernetes: Metrics, Prometheus, and Sysdig

etcd is a distributed key‑value store that backs the Kubernetes control plane, storing cluster state such as pods, secrets, and deployments. It offers a simple JSON/gRPC API and relies on the Raft consensus algorithm for consistency and fault tolerance.

The article first introduces etcd’s purpose and the three possible node roles (follower, candidate, leader) and explains how leader election and log commitment work.

It then lists common failure points for an etcd cluster, emphasizing that a loss of the leader or a complete outage can render the entire Kubernetes cluster unusable.

Collecting etcd metrics

By default etcd exposes Prometheus‑compatible metrics on port 4001 of the master node. Access requires client‑certificate authentication. Example curl commands:

curl https://localhost:4001/metrics -k --cert /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt --key /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key

curl https://[master_ip]:4001/metrics -k --cert /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt --key /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key

The response contains a long list of metric families, e.g. # HELP etcd_disk_backend_snapshot_duration_seconds and histogram buckets.

Prometheus configuration

To scrape etcd metrics, create a secret with the client certificates and mount it into the Prometheus server deployment:

kubectl -n monitoring create secret generic etcd-ca --from-file=etcd-clients-ca.key --from-file=etcd-clients-ca.crt

kubectl -n monitoring patch deployment prometheus-server -p '{"spec":{"template":{"spec":{"volumes":[{"name":"etcd-ca","secret":{"defaultMode":420,"secretName":"etcd-ca"}}]}}}}'

kubectl -n monitoring patch deployment prometheus-server -p '{"spec":{"template":{"spec":{"containers":[{"name":"prometheus-server","volumeMounts":[{"mountPath":"/opt/prometheus/secrets","name":"etcd-ca"}]}]}}}}'

Then add a scrape_configs job for etcd, using TLS settings that point to the mounted certificates.

scrape_configs:
  - job_name: etcd
    scheme: https
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
        separator: '/'
        regex: 'kube-system/etcd-manager-main.+'
        action: keep
    ...
    tls_config:
      insecure_skip_verify: true
      cert_file: /opt/prometheus/secrets/etcd-clients-ca.crt
      key_file: /opt/prometheus/secrets/etcd-clients-ca.key

Key alerting expressions include node availability, leader presence, leader change frequency, and proposal failure rates, e.g.:

sum(up{job="etcd"})

# HELP etcd_server_has_leader Whether or not a leader exists. 1 = exists, 0 = none
# TYPE etcd_server_has_leader gauge
etcd_server_has_leader 1

# HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
etcd_server_leader_changes_seen_total 1

rate(etcd_server_proposals_failed_total{job=~"etcd"}[15m]) > 5

Other important metrics cover proposal counts (applied, committed, failed, pending) and disk latency histograms such as etcd_disk_backend_commit_duration_seconds and etcd_disk_wal_fsync_duration_seconds. Histogram quantiles can be used to assess 99th‑percentile latency.

Monitoring etcd with Sysdig Monitor

Deploy a Prometheus instance (namespace monitoring) and then configure the Sysdig agent to scrape the etcd job. Example deployment steps:

kubectl create ns monitoring
helm install -f values.yaml prometheus -n monitoring stable/prometheus

Provide a values.yaml with appropriate scrape annotations, then create a ConfigMap for the Sysdig agent that adds a sysdig_sd_configs job filtering for etcd pods.

apiVersion: v1
kind: ConfigMap
metadata:
  name: sysdig-agent
  namespace: sysdig-agent
data:
  prometheus.yaml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    scrape_configs:
    - job_name: 'prometheus'
      honor_labels: true
      metrics_path: '/federate'
      metric_relabel_configs:
      - regex: 'kubernetes_pod_name'
        action: labeldrop
      params:
        'match[]':
        - '{sysdig="true"}'
      sysdig_sd_configs:
      - tags:
          namespace: monitoring
          deployment: prometheus-server

With these configurations, Sysdig Monitor collects all etcd metrics, allowing you to set alerts on the most critical health indicators before a cluster‑wide failure occurs.

Conclusion

etcd is a simple yet powerful component required for any Kubernetes deployment. Although Raft provides resilience against many transient issues, proactive monitoring and alerting of etcd health—node availability, leader status, proposal failures, and disk latency—are essential to keep the control plane reliable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring etcd sysdig

Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.