Operations 12 min read

What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure

This article shares Anthropic’s SRE engineer insights on 23 actionable practices—from schema migration and Karpenter node management to OpenTelemetry adoption, Helm chart storage, and Terraform versus CloudFormation—offering concrete recommendations for building reliable, cost‑effective AI and cloud‑native platforms.

Efficient Ops

Jun 3, 2025

What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure

As AI models grow faster, their training costs are projected to reach $5‑10 billion by 2025‑2026, making robust SRE practices essential. An Anthropic engineer shares the tools and techniques that have proven valuable in their high‑scale AI infrastructure.

Below are the key recommendations.

1. Schema migration via Diff

Storing the entire schema in Git and generating SQL from it reduces risk; a faulty migration can delete critical data.

2. Use Karpenter for node management

If you run EKS without full Fargate adoption, Karpenter is a reliable, cost‑effective autoscaler, outperforming Cluster Autoscaler and SpotInst.

3. Ubuntu for development servers

Choosing Ubuntu provides broad package support and improves developer efficiency compared to mirroring the Kubernetes node OS.

4. AppSmith for internal tooling

A self‑hosted AppSmith instance offers a simple UI for engineers to automate tasks like restarts, deployments, and diagnostics, at no cost.

5. Helm

Helm v3 is stable for packaging and versioning Kubernetes objects; its Go templating is powerful despite a steep debugging curve.

6. Store Helm charts in OCI (ECR)

Moving from S3‑based plugins to OCI storage simplifies lifecycle management and improves reliability.

7. Bazel (optional)

Bazel is favored by many engineers but can be over‑complex for Go services; GitHub Actions may be more approachable.

8. Adopt OpenTelemetry early

Switching from direct DataDog API pushes to OpenTelemetry avoids future migration pain; it excels at distributed tracing.

9. Choose Renovatebot over Dependabot

Renovate offers flexible configuration despite higher setup complexity, making it the better choice overall.

10. Kubernetes as the platform

Kubernetes integrates well with AWS services, though its flexibility introduces many possible misuse patterns that require careful design.

11. Purchase dedicated IP blocks

Owning a larger CIDR for partner whitelisting reduces operational overhead as systems scale.

12. Use Flux for GitOps

Flux (v1 then upgraded to v2) proved a solid choice; ArgoCD is also viable.

13. SealedSecrets for Kubernetes secrets

It complicates secret updates for developers and loses AWS native rotation automation.

14. ExternalSecrets for secret sync

ExternalSecrets syncs AWS secrets to Kubernetes smoothly and works well with Terraform.

15. ExternalDNS for DNS management

ExternalDNS reliably syncs Kubernetes services with Route53.

16. cert‑manager for SSL certificates

Provides straightforward Let’s Encrypt integration; occasional legacy stack issues may require paid certificates.

17. Bottlerocket for EKS (regret)

Network CSI issues and difficult debugging led to reverting to the standard EKS‑optimized AMI.

18. Choose Terraform over CloudFormation

Terraform’s HCL is easier to read and extends well to other SaaS providers.

19. Consider code‑centric IaC (Pulumi, CDK)

Terraform’s HCL limits complexity; Pulumi‑style solutions can be useful but Terraform remains sufficient for most cases.

20. Service mesh (Istio, Linkerd) – optional

Service meshes are powerful but often over‑engineered; simplicity is preferred.

21. Nginx Ingress in EKS

Nginx is a mature, stable load balancer.

22. Distribute scripts with Homebrew

Homebrew works well for delivering scripts and binaries to Linux and macOS engineers.

23. Use Go for services

Go’s simplicity and performance make it ideal for network‑I/O‑bound services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes DevOps SRE Infrastructure

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.