How Habby Cut MTTR by 80% with Amazon DevOps Agent: A Game‑Industry Smart Ops Blueprint
Habby, a global casual‑game publisher, tackled traffic spikes, multi‑account complexity, rapid releases, and a small ops team by deeply integrating Amazon DevOps Agent with Grafana, Lark and GitHub, automating incident triage, on‑demand tasks and proactive prevention, which slashed MTTR from 2 hours to 20 minutes, reduced alert fatigue and boosted system reliability.
Game Industry Operations Challenges
Game traffic is highly volatile: major releases, limited‑time events and global player bases cause short‑term surges and 24‑hour wave‑like loads. Multi‑account, multi‑service architectures (EKS, Lambda, DynamoDB, ElastiCache, API Gateway) make root‑cause tracing across accounts difficult, and small SRE teams suffer from "alert fatigue".
Frequent code and configuration changes mean teams must quickly determine whether a recent change caused an incident.
Amazon DevOps Agent Overview
Amazon DevOps Agent is an AI‑driven agent that automatically responds to and prevents incidents. It performs incident triage, evaluates severity, collects logs, metrics, recent code changes, generates RCA reports and mitigation plans, and can be guided by operators.
Core Capabilities
Autonomous Event Response : When an alert is configured, the Agent starts investigation, performs triage, assesses impact, and builds a detailed RCA.
On‑Demand DevOps Tasks : A conversational AI assistant in the Agent Space web app lets users query resources, run health checks, generate reports, and keep conversation history.
Proactive Event Prevention : Analyzes historical incidents to suggest observability improvements, infrastructure tweaks, pipeline enhancements and architectural resilience.
Habby's DevOps Agent Solution
Habby built a two‑part solution:
Alert‑triggered Lambda processes Grafana alerts, posts interactive Lark cards, and invokes the DevOps Agent webhook.
Lark Bot acts as a conversational front‑end, calling the Agent Chat API for real‑time interaction.
The architecture integrates telemetry sources (CloudWatch, Grafana, Prometheus), CI/CD pipelines (GitHub/GitLab), communication channels (Slack, Teams, ServiceNow, Lark) and custom skills.
Implementation Details
Key steps include:
Configure Grafana alerts to SNS, have Lambda parse alerts, send Lark cards and trigger the Agent webhook with HMAC‑SHA256 signed payload.
Store webhook URLs and secrets in Amazon Secrets Manager and cache them in Lambda.
Use Lark’s WebSocket SDK (lark‑oapi) for a persistent bot connection without exposing public endpoints.
Map each Lark chat_id to a DevOps Agent executionId for context‑aware multi‑turn dialogs.
def lambda_handler(event, context):
"""Lambda entry point triggered by SNS.
1. Extract message from SNS
2. Parse Grafana alert
3. Push Lark card (non‑blocking)
4. Build and call DevOps Agent webhook
"""
# ... construct payload, sign with HMAC‑SHA256, call webhook ...Agent‑side code uses boto3 to send messages and collect streamed responses:
devops = boto3.client("devops-agent", region_name=AWS_REGION)
def ask_devops_agent(session_key: str, query: str) -> str:
"""Send a query to the Agent and collect the final reply."""
execution_id = get_or_create_execution(session_key)
resp = devops.send_message(agentSpaceId=AGENT_SPACE_ID, executionId=execution_id, content=query)
# aggregate contentBlockDelta events ...
return "".join(blocks[max(blocks)])Benefits Achieved
Traditional manual handling required 1.5‑2 hours of investigation and 4‑6 hours MTTR for off‑hours alerts. After adopting DevOps Agent:
Automated data collection and RCA generation in minutes.
Human verification time reduced to 15‑30 minutes.
Average MTTR dropped to ~20 minutes – an 80% reduction.
Alert fatigue decreased as the Agent aggregates related alerts into single incidents and auto‑downgrades low‑priority alerts.
Operational efficiency improved: no need to switch consoles, all information is available via Agent Space or Lark bot.
System reliability grew through weekly preventive suggestions and knowledge‑base skills.
Phased Rollout Guide
Phase 1 : Deploy a single Agent Space in a production account, integrate CloudWatch alarms, enable chat‑based triage.
Phase 2 : Add Grafana telemetry, connect GitHub for change‑to‑incident correlation, enable multi‑account monitoring.
Phase 3 : Create custom Skills to codify investigation expertise, review and act on weekly protection recommendations.
Best Practices & Recommendations
Clean up stale alerts and set composite alarms (e.g., CPU > 80% for 5 min) to reduce noise.
Include rich context in alert descriptions (service, region, metric).
Use investigation guidance via chat to steer the Agent when automatic analysis is insufficient.
Schedule bi‑weekly review meetings to prioritize and implement protection suggestions.
Separate Agent Spaces per game project and per environment (prod vs non‑prod) for isolation.
Apply fine‑grained IAM roles: read‑only for developers, full access for ops.
Conclusion & Outlook
By deeply integrating Amazon DevOps Agent, Habby transformed its ops model from manual firefighting to AI‑driven intelligent operations, achieving an 80% MTTR reduction, a 75% drop in alert fatigue, and noticeable availability gains. Future work includes moving from recommendation to automated remediation, configurable investigation windows, richer long‑session handling, and smoother integration experiences.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Amazon Cloud Developers
Official technical community of Amazon Cloud. Shares practical AI/ML, big data, database, modern app development, IoT content, offers comprehensive learning resources, hosts regular developer events, and continuously empowers developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
