Cloud Native 9 min read

Essential Kubernetes Production Checklist for Web Services

A comprehensive, step‑by‑step checklist guides teams through documentation, application design, security, CI/CD, Kubernetes configuration, monitoring, testing, and 24/7 support to reliably run web services with HTTP APIs in production on Kubernetes.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Essential Kubernetes Production Checklist for Web Services

Running applications in production can be tricky. This article presents a thorough checklist for deploying web services (applications exposing an HTTP API) on Kubernetes.

General

Application name, description, purpose, and owning team are clearly documented (e.g., via a service tree).

Define the application's criticality level (e.g., "critical link service" for business‑critical apps).

The development team has sufficient Kubernetes knowledge/experience, such as understanding stateless services.

A 24/7 on‑call team is identified and notified.

An upgrade plan exists, including potential rollback steps.

Application

The code repository contains clear instructions on development, configuration, and changes (crucial for emergency fixes).

Dependencies are pinned so that patch changes do not unintentionally introduce new libraries.

OpenTracing/OpenTelemetry semantic conventions are followed.

All outbound HTTP calls define timeouts.

HTTP connection pools are sized appropriately for expected traffic.

Thread pools or non‑blocking asynchronous code are correctly implemented and configured.

Redis and database connection pools have correct sizes.

Retry and back‑off strategies are implemented for dependent services.

A rollback mechanism is defined based on business requirements.

Rate‑limiting or throttling mechanisms are in place (often provided by the underlying infrastructure).

Application metrics are exposed for collection (e.g., scraped by Prometheus).

Application logs are written to stdout/stderr.

Logs follow best practices (structured logging, meaningful messages), have clearly defined levels, and debug logging is disabled by default in production.

The application container crashes only on fatal errors, not due to unrecoverable states or deadlocks.

Design and code are reviewed by senior engineers.

Security & Compliance

The application runs as a non‑privileged (non‑root) user.

The container file system is read‑only where possible.

HTTP requests are authenticated and authorized (e.g., using OAuth).

Denial‑of‑service mitigation mechanisms are in place (e.g., ingress rate limiting, WAF).

Security audits have been performed.

Automated vulnerability scanning for code and dependencies is enabled.

Processed data is understood, classified (e.g., PII), and documented.

A threat model has been created and risks recorded.

Other applicable organizational rules and compliance standards are followed.

Continuous Integration / Continuous Delivery

Every change triggers an automated pipeline.

Automated tests are part of the delivery pipeline.

Production deployments require no manual steps.

All relevant team members can deploy and roll back.

Production deployments include smoke tests and optional automatic rollbacks.

Lead time from code commit to production is short (e.g., 15 minutes or less, including test execution).

Kubernetes

The development team has received Kubernetes training and understands related concepts.

Kubernetes manifests use the latest API versions (e.g., apps/v1 for Deployments).

Containers run as non‑root users with read‑only file systems.

Appropriate readiness probes are defined.

Liveness probes are omitted or used only with a clear justification.

Deployments have at least two replicas.

Horizontal Pod Autoscaling (HPA) is configured when appropriate.

Memory and CPU requests are set based on performance and load testing.

Memory limits equal memory requests to avoid over‑consumption.

CPU limits are either unset or their throttling impact is well understood.

Application runtime settings (e.g., JVM heap, single‑threaded runtime, non‑container‑aware runtimes) are correctly configured for the container environment.

Each container runs a single application process.

The application can handle graceful shutdowns and rolling updates without interruption.

If graceful termination is not handled, a Pod lifecycle hook (e.g., preStop with "sleep 20") is used.

All required Pod labels are set.

The application is configured for high availability: Pods are spread across failure domains or deployed to multiple clusters.

Kubernetes Services use correct label selectors (e.g., matching not only "app" but also "component" and "environment" for future scaling).

Optional: Tolerations are used as needed (e.g., binding Pods to specific node pools).

Monitoring

Metrics for the four golden signals are collected.

Application metrics are collected (e.g., scraped by Prometheus).

Databases (e.g., PostgreSQL) are monitored.

Service Level Objectives (SLOs) are defined.

Monitoring dashboards exist (e.g., Grafana) and can be provisioned automatically.

Alert rules are defined based on impact rather than root cause.

Testing

Chaos/breakpoint testing is performed.

Load testing reflects expected traffic patterns.

Backup and restore procedures for data stores (e.g., PostgreSQL) are tested.

24/7 Service Team

All relevant 24/7 service teams are notified of releases (e.g., SRE, incident commanders).

The on‑call team has sufficient knowledge of the application and business context.

The team possesses necessary production access (e.g., kubectl, kube‑web‑view, application logs).

The team has expertise to troubleshoot production issues in the tech stack (e.g., JVM).

The team is trained and confident in executing standard operations (scaling, rollback, etc.).

Monitoring alerts are set up to page the 24/7 team.

Automatic escalation rules are in place (e.g., escalating after 10 minutes without acknowledgment).

Post‑incident analysis and knowledge sharing processes exist.

Regular application‑operation reviews are conducted (e.g., reviewing SLO violations).

observabilityKubernetesDevOpsWeb ServicesProductionchecklist
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.