Operations 18 min read

Lessons Learned from Two Years of Running Kubernetes in Production

This article recounts a two‑year journey of migrating from Ansible‑managed EC2 deployments to Kubernetes, detailing the motivations, migration strategy, operational challenges, tooling choices, resource management, security, cost considerations, and the development of custom controllers and CRDs to run production workloads reliably.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
Lessons Learned from Two Years of Running Kubernetes in Production

About two years ago the team decided to abandon Ansible‑based deployments on EC2 and adopt containerization with Kubernetes for application orchestration, eventually migrating most of their infrastructure to Kubernetes despite the technical and cultural challenges involved.

The authors explain that while serverless and containers are attractive for new projects, adopting Kubernetes requires sufficient bandwidth, expertise, and a DevOps mindset; otherwise the migration effort can be daunting.

Key motivations for the migration included the need for a continuous‑integration infrastructure that could quickly rebuild and test many microservices, reducing the bottlenecks caused by shared pre‑release environments.

They describe their CI pipeline that can spin up an integrated environment for 21 microservices in eight minutes, allowing developers to test changes in isolation and dramatically shortening the overall test cycle.

The article emphasizes that out‑of‑the‑box Kubernetes is insufficient; a production‑grade platform needs additional components such as metrics (Prometheus), logging (Grafana Loki), configuration and secret management (Consul, Vault, side‑car templates), and CI/CD tools (Jenkins, Tekton, Argo Workflows).

Operating the clusters proved complex: setting up autoscaling, networking, and resource requests/limits required careful tuning, and upgrades remained non‑trivial even on managed services. The team adopted GitOps practices with eksctl, Terraform, and automated pipelines to simplify cluster provisioning and updates.

Resource management lessons highlighted the balance between requests and limits to avoid pod eviction while maintaining high utilization, especially in production versus non‑production environments.

Security and governance were addressed by using Open Policy Agent to enforce policies such as preventing public ELBs, and by adopting a least‑privilege approach.

Cost analysis showed significant savings from better resource utilization and spot instances, though cross‑AZ data transfer costs increased, prompting consideration of service meshes.

Finally, the team built custom CRDs, operators, and controllers to automate tasks like LoadBalancer‑to‑Ingress conversion and DNS record creation, further streamlining operations on their "Grofers Kubernetes Platform".

CI/CDobservabilityKubernetesDevOpscloudinfrastructureProduction
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.