What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure
This article shares Anthropic’s SRE engineer insights on 23 actionable practices—from schema migration and Karpenter node management to OpenTelemetry adoption, Helm chart storage, and Terraform versus CloudFormation—offering concrete recommendations for building reliable, cost‑effective AI and cloud‑native platforms.
As AI models grow faster, their training costs are projected to reach $5‑10 billion by 2025‑2026, making robust SRE practices essential. An Anthropic engineer shares the tools and techniques that have proven valuable in their high‑scale AI infrastructure.
Below are the key recommendations.
1. Schema migration via Diff
Recommended
Storing the entire schema in Git and generating SQL from it reduces risk; a faulty migration can delete critical data.
2. Use Karpenter for node management
Recommended
If you run EKS without full Fargate adoption, Karpenter is a reliable, cost‑effective autoscaler, outperforming Cluster Autoscaler and SpotInst.
3. Ubuntu for development servers
Recommended
Choosing Ubuntu provides broad package support and improves developer efficiency compared to mirroring the Kubernetes node OS.
4. AppSmith for internal tooling
Recommended
A self‑hosted AppSmith instance offers a simple UI for engineers to automate tasks like restarts, deployments, and diagnostics, at no cost.
5. Helm
Recommended
Helm v3 is stable for packaging and versioning Kubernetes objects; its Go templating is powerful despite a steep debugging curve.
6. Store Helm charts in OCI (ECR)
Recommended
Moving from S3‑based plugins to OCI storage simplifies lifecycle management and improves reliability.
7. Bazel (optional)
Not sure whether to recommend
Bazel is favored by many engineers but can be over‑complex for Go services; GitHub Actions may be more approachable.
8. Adopt OpenTelemetry early
Strongly regret not using earlier
Switching from direct DataDog API pushes to OpenTelemetry avoids future migration pain; it excels at distributed tracing.
9. Choose Renovatebot over Dependabot
Recommended
Renovate offers flexible configuration despite higher setup complexity, making it the better choice overall.
10. Kubernetes as the platform
Recommended
Kubernetes integrates well with AWS services, though its flexibility introduces many possible misuse patterns that require careful design.
11. Purchase dedicated IP blocks
Recommended
Owning a larger CIDR for partner whitelisting reduces operational overhead as systems scale.
12. Use Flux for GitOps
No regrets
Flux (v1 then upgraded to v2) proved a solid choice; ArgoCD is also viable.
13. SealedSecrets for Kubernetes secrets
Strongly not recommended
It complicates secret updates for developers and loses AWS native rotation automation.
14. ExternalSecrets for secret sync
Recommended
ExternalSecrets syncs AWS secrets to Kubernetes smoothly and works well with Terraform.
15. ExternalDNS for DNS management
Recommended
ExternalDNS reliably syncs Kubernetes services with Route53.
16. cert‑manager for SSL certificates
Recommended
Provides straightforward Let’s Encrypt integration; occasional legacy stack issues may require paid certificates.
17. Bottlerocket for EKS (regret)
Strongly regret
Network CSI issues and difficult debugging led to reverting to the standard EKS‑optimized AMI.
18. Choose Terraform over CloudFormation
Recommended
Terraform’s HCL is easier to read and extends well to other SaaS providers.
19. Consider code‑centric IaC (Pulumi, CDK)
No regrets
Terraform’s HCL limits complexity; Pulumi‑style solutions can be useful but Terraform remains sufficient for most cases.
20. Service mesh (Istio, Linkerd) – optional
No regrets
Service meshes are powerful but often over‑engineered; simplicity is preferred.
21. Nginx Ingress in EKS
No regrets
Nginx is a mature, stable load balancer.
22. Distribute scripts with Homebrew
Recommended
Homebrew works well for delivering scripts and binaries to Linux and macOS engineers.
23. Use Go for services
Recommended
Go’s simplicity and performance make it ideal for network‑I/O‑bound services.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.