Detecting Agent Silent Killers: Early Alerts for Latency Spikes, Token Explosions, and Infinite Loops
The article presents a three‑layer monitoring system—LangSmith tracing, Prometheus metrics, and Alertmanager alerts—together with concrete metric definitions, alert rules, and code examples to proactively detect latency spikes, token overuse, and dead‑loop cycles in production LLM agents, while also outlining common pitfalls and best‑practice recommendations.
01 Silent failure modes in LLM agents
Latency spikes (average 2 s → 45 s), token explosions (hundreds of thousands per request), and dead loops (graph iterates indefinitely until rate limit) do not raise exceptions, so they require dedicated monitoring.
02 Three‑layer observability stack
Tracing layer (LangSmith) : records node execution time, token usage, errors as OpenTelemetry spans.
Metrics layer (Prometheus + Grafana) : aggregates spans into time‑series metrics.
Alert layer (Alertmanager) : fires alerts based on Prometheus rules and routes to PagerDuty/WeChat/Feishu.
03 Instrumentation via LangGraph callbacks
LangGraph provides entry/exit callbacks for each node. The following TypeScript example registers four Prometheus metrics and implements a callback handler.
import { BaseCallbackHandler } from "@langchain/core/callbacks/base";
import { Counter, Histogram, Gauge } from "prom-client";
const nodeExecutionDuration = new Histogram({
name: "agent_node_execution_seconds",
help: "Execution time of each LangGraph node",
labelNames: ["graph_name", "node_name"],
buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60]
});
const tokenUsageCounter = new Counter({
name: "agent_token_total",
help: "Total token consumption per request",
labelNames: ["graph_name", "model", "type"]
});
const nodeIterationCounter = new Counter({
name: "agent_node_iterations_total",
help: "Node execution count (used for dead‑loop detection)",
labelNames: ["graph_name", "node_name"]
});
const activeRunsGauge = new Gauge({
name: "agent_active_runs",
help: "Number of currently running agents",
labelNames: ["graph_name"]
});
export class PrometheusCallbackHandler extends BaseCallbackHandler {
name = "PrometheusCallbackHandler";
private runTimers = new Map<string, number>();
private graphName: string;
constructor(graphName: string) {
super();
this.graphName = graphName;
}
async handleChainStart(_chain: any, _inputs: any, runId: string) {
this.runTimers.set(runId, Date.now());
activeRunsGauge.inc({ graph_name: this.graphName });
}
async handleChainEnd(_outputs: any, runId: string) {
const start = this.runTimers.get(runId);
if (start) {
const duration = (Date.now() - start) / 1000;
nodeExecutionDuration.observe({ graph_name: this.graphName, node_name: "chain" }, duration);
this.runTimers.delete(runId);
}
activeRunsGauge.dec({ graph_name: this.graphName });
}
async handleLLMEnd(output: any, _runId: string) {
const usage = output?.llmOutput?.tokenUsage;
if (usage) {
const model = output?.generations?.[0]?.[0]?.generationInfo?.model ?? "unknown";
tokenUsageCounter.inc({ graph_name: this.graphName, model, type: "prompt" }, usage.promptTokens ?? 0);
tokenUsageCounter.inc({ graph_name: this.graphName, model, type: "completion" }, usage.completionTokens ?? 0);
}
}
async handleChainError(err: Error, runId: string) {
this.runTimers.delete(runId);
activeRunsGauge.dec({ graph_name: this.graphName });
console.error(`[Agent Error] graph=${this.graphName} runId=${runId}`, err);
}
}Inject the handler when invoking the graph:
const metricsCallback = new PrometheusCallbackHandler("customer-service");
await graph.invoke(input, { configurable: { thread_id: threadId }, callbacks: [metricsCallback] });04 Dead‑loop detection
Three defensive layers:
Set recursionLimit on the graph (default 25) as a hard stop.
Track iterationCount in the agent state and abort when it exceeds maxIterations, emitting a critical alert.
Define a Prometheus rule that triggers when agent_node_iterations_total exceeds a threshold (e.g., >50 executions in 5 minutes).
05 Alert rule design
Tiered alerts to avoid fatigue:
P0 (critical) : dead loops and token explosions.
P1 (warning) : high latency or error‑rate during working hours.
P2 (info) : cost‑budget warnings.
groups:
- name: agent_alerts
rules:
- alert: AgentDeadLoop
expr: |
rate(agent_node_iterations_total[5m]) > 10
AND agent_node_iterations_total > 50
for: 1m
labels:
severity: critical
pagerduty: "true"
annotations:
summary: "Agent suspected dead loop"
description: "Graph {{ $labels.graph_name }} node {{ $labels.node_name }} executed >50 times in 5 min"
- alert: AgentTokenExplosion
expr: |
sum by (graph_name) (rate(agent_token_total[10m])) > 100000
for: 2m
labels:
severity: critical
annotations:
summary: "Token consumption surge"
description: "Graph {{ $labels.graph_name }} used >100k tokens in 10 min"
- alert: AgentHighLatency
expr: |
histogram_quantile(0.95, rate(agent_node_execution_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Agent P95 latency >30 s"
description: "Graph {{ $labels.graph_name }} P95 response time {{ $value }} s"
- alert: AgentCostBudgetWarning
expr: |
sum(agent_token_total * on(model) group_left() agent_token_cost_per_unit) > 50
for: 0m
labels:
severity: info
annotations:
summary: "Daily cost exceeds $50"06 Token‑based cost tracking
Python implementation (langchain ≥ 0.3, langgraph ≥ 0.2, prometheus_client ≥ 0.20):
from prometheus_client import Counter, Histogram, Gauge
from langchain_core.callbacks import BaseCallbackHandler
import time
node_execution_duration = Histogram(
"agent_node_execution_seconds",
"Execution time of each LangGraph node",
["graph_name", "node_name"],
buckets=[0.1, 0.5, 1, 2, 5, 10, 30, 60],
)
token_usage_counter = Counter(
"agent_token_total",
"Agent token consumption",
["graph_name", "model", "type"],
)
node_iteration_counter = Counter(
"agent_node_iterations_total",
"Node execution count (dead‑loop detection)",
["graph_name", "node_name"],
)
active_runs_gauge = Gauge(
"agent_active_runs",
"Number of currently running agents",
["graph_name"],
)
class PrometheusCallbackHandler(BaseCallbackHandler):
def __init__(self, graph_name: str):
super().__init__()
self.graph_name = graph_name
self._run_timers: dict[str, float] = {}
def on_chain_start(self, serialized: dict, inputs: dict, run_id, **kwargs):
self._run_timers[str(run_id)] = time.time()
active_runs_gauge.labels(graph_name=self.graph_name).inc()
def on_chain_end(self, outputs: dict, run_id, **kwargs):
start = self._run_timers.pop(str(run_id), None)
if start:
duration = time.time() - start
node_execution_duration.labels(graph_name=self.graph_name, node_name="chain").observe(duration)
active_runs_gauge.labels(graph_name=self.graph_name).dec()
def on_llm_end(self, response, run_id, **kwargs):
usage = getattr(response, "llm_output", {}) or {}
token_usage = usage.get("token_usage", {})
model = (
response.generations[0][0].generation_info.get("model", "unknown")
if response.generations else "unknown"
)
token_usage_counter.labels(graph_name=self.graph_name, model=model, type="prompt").inc(
token_usage.get("prompt_tokens", 0)
)
token_usage_counter.labels(graph_name=self.graph_name, model=model, type="completion").inc(
token_usage.get("completion_tokens", 0)
)
def on_chain_error(self, error: Exception, run_id, **kwargs):
self._run_timers.pop(str(run_id), None)
active_runs_gauge.labels(graph_name=self.graph_name).dec()
print(f"[Agent Error] graph={self.graph_name} run_id={run_id}: {error}")Cost tracker (Python):
TOKEN_COST = {
"gpt-4o": {"input": 0.005, "output": 0.015},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
"claude-3-haiku": {"input": 0.00025, "output": 0.00125},
}
class CostTracker:
def __init__(self):
self.daily_usage = {}
self.alert_thresholds = {"per_request": 0.5, "daily": 50}
def record_usage(self, record):
cost = self._calculate_cost(record)
self.daily_usage[record.graph_name] = self.daily_usage.get(record.graph_name, 0) + cost
if cost > self.alert_thresholds["per_request"]:
self._trigger_alert({"level": "warning", "type": "per_request_cost",
"graph_name": record.graph_name, "cost": cost})
if self.daily_usage[record.graph_name] > self.alert_thresholds["daily"]:
self._trigger_alert({"level": "critical", "type": "daily_budget_exceeded",
"graph_name": record.graph_name, "cost": self.daily_usage[record.graph_name]})
def _calculate_cost(self, record):
pricing = TOKEN_COST.get(record.model)
if not pricing:
return 0.0
return (record.prompt_tokens / 1000) * pricing["input"] + (record.completion_tokens / 1000) * pricing["output"]
def _trigger_alert(self, alert):
print(f"[Cost Alert] {alert}")TypeScript cost tracker mirrors the same thresholds and uses the same TOKEN_COST map.
07 Production pitfalls
Pitfall 1: HTTP 200 does not guarantee success; inspect response.body.error or LangSmith run.error.
Pitfall 2: In streaming mode handleLLMEnd fires only after the stream ends; use handleLLMNewToken for incremental token counting.
Pitfall 3: Do not use volatile identifiers such as runId as Prometheus labels; keep labels stable ( graph_name, node_name, model).
Pitfall 4: Set alert thresholds based on observed traffic; overly low P95 latency thresholds cause alert fatigue.
Pitfall 5: Monitor the monitoring stack itself (e.g., Prometheus up metric) to detect watchdog failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
