Operations 18 min read

Detecting Agent Silent Killers: Early Alerts for Latency Spikes, Token Explosions, and Infinite Loops

The article presents a three‑layer monitoring system—LangSmith tracing, Prometheus metrics, and Alertmanager alerts—together with concrete metric definitions, alert rules, and code examples to proactively detect latency spikes, token overuse, and dead‑loop cycles in production LLM agents, while also outlining common pitfalls and best‑practice recommendations.

James' Growth Diary
James' Growth Diary
James' Growth Diary
Detecting Agent Silent Killers: Early Alerts for Latency Spikes, Token Explosions, and Infinite Loops

01 Silent failure modes in LLM agents

Latency spikes (average 2 s → 45 s), token explosions (hundreds of thousands per request), and dead loops (graph iterates indefinitely until rate limit) do not raise exceptions, so they require dedicated monitoring.

02 Three‑layer observability stack

Tracing layer (LangSmith) : records node execution time, token usage, errors as OpenTelemetry spans.

Metrics layer (Prometheus + Grafana) : aggregates spans into time‑series metrics.

Alert layer (Alertmanager) : fires alerts based on Prometheus rules and routes to PagerDuty/WeChat/Feishu.

03 Instrumentation via LangGraph callbacks

LangGraph provides entry/exit callbacks for each node. The following TypeScript example registers four Prometheus metrics and implements a callback handler.

import { BaseCallbackHandler } from "@langchain/core/callbacks/base";
import { Counter, Histogram, Gauge } from "prom-client";

const nodeExecutionDuration = new Histogram({
  name: "agent_node_execution_seconds",
  help: "Execution time of each LangGraph node",
  labelNames: ["graph_name", "node_name"],
  buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60]
});
const tokenUsageCounter = new Counter({
  name: "agent_token_total",
  help: "Total token consumption per request",
  labelNames: ["graph_name", "model", "type"]
});
const nodeIterationCounter = new Counter({
  name: "agent_node_iterations_total",
  help: "Node execution count (used for dead‑loop detection)",
  labelNames: ["graph_name", "node_name"]
});
const activeRunsGauge = new Gauge({
  name: "agent_active_runs",
  help: "Number of currently running agents",
  labelNames: ["graph_name"]
});

export class PrometheusCallbackHandler extends BaseCallbackHandler {
  name = "PrometheusCallbackHandler";
  private runTimers = new Map<string, number>();
  private graphName: string;

  constructor(graphName: string) {
    super();
    this.graphName = graphName;
  }

  async handleChainStart(_chain: any, _inputs: any, runId: string) {
    this.runTimers.set(runId, Date.now());
    activeRunsGauge.inc({ graph_name: this.graphName });
  }

  async handleChainEnd(_outputs: any, runId: string) {
    const start = this.runTimers.get(runId);
    if (start) {
      const duration = (Date.now() - start) / 1000;
      nodeExecutionDuration.observe({ graph_name: this.graphName, node_name: "chain" }, duration);
      this.runTimers.delete(runId);
    }
    activeRunsGauge.dec({ graph_name: this.graphName });
  }

  async handleLLMEnd(output: any, _runId: string) {
    const usage = output?.llmOutput?.tokenUsage;
    if (usage) {
      const model = output?.generations?.[0]?.[0]?.generationInfo?.model ?? "unknown";
      tokenUsageCounter.inc({ graph_name: this.graphName, model, type: "prompt" }, usage.promptTokens ?? 0);
      tokenUsageCounter.inc({ graph_name: this.graphName, model, type: "completion" }, usage.completionTokens ?? 0);
    }
  }

  async handleChainError(err: Error, runId: string) {
    this.runTimers.delete(runId);
    activeRunsGauge.dec({ graph_name: this.graphName });
    console.error(`[Agent Error] graph=${this.graphName} runId=${runId}`, err);
  }
}

Inject the handler when invoking the graph:

const metricsCallback = new PrometheusCallbackHandler("customer-service");
await graph.invoke(input, { configurable: { thread_id: threadId }, callbacks: [metricsCallback] });

04 Dead‑loop detection

Three defensive layers:

Set recursionLimit on the graph (default 25) as a hard stop.

Track iterationCount in the agent state and abort when it exceeds maxIterations, emitting a critical alert.

Define a Prometheus rule that triggers when agent_node_iterations_total exceeds a threshold (e.g., >50 executions in 5 minutes).

05 Alert rule design

Tiered alerts to avoid fatigue:

P0 (critical) : dead loops and token explosions.

P1 (warning) : high latency or error‑rate during working hours.

P2 (info) : cost‑budget warnings.

groups:
  - name: agent_alerts
    rules:
      - alert: AgentDeadLoop
        expr: |
          rate(agent_node_iterations_total[5m]) > 10
          AND agent_node_iterations_total > 50
        for: 1m
        labels:
          severity: critical
          pagerduty: "true"
        annotations:
          summary: "Agent suspected dead loop"
          description: "Graph {{ $labels.graph_name }} node {{ $labels.node_name }} executed >50 times in 5 min"

      - alert: AgentTokenExplosion
        expr: |
          sum by (graph_name) (rate(agent_token_total[10m])) > 100000
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Token consumption surge"
          description: "Graph {{ $labels.graph_name }} used >100k tokens in 10 min"

      - alert: AgentHighLatency
        expr: |
          histogram_quantile(0.95, rate(agent_node_execution_seconds_bucket[5m])) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent P95 latency >30 s"
          description: "Graph {{ $labels.graph_name }} P95 response time {{ $value }} s"

      - alert: AgentCostBudgetWarning
        expr: |
          sum(agent_token_total * on(model) group_left() agent_token_cost_per_unit) > 50
        for: 0m
        labels:
          severity: info
        annotations:
          summary: "Daily cost exceeds $50"

06 Token‑based cost tracking

Python implementation (langchain ≥ 0.3, langgraph ≥ 0.2, prometheus_client ≥ 0.20):

from prometheus_client import Counter, Histogram, Gauge
from langchain_core.callbacks import BaseCallbackHandler
import time

node_execution_duration = Histogram(
    "agent_node_execution_seconds",
    "Execution time of each LangGraph node",
    ["graph_name", "node_name"],
    buckets=[0.1, 0.5, 1, 2, 5, 10, 30, 60],
)
token_usage_counter = Counter(
    "agent_token_total",
    "Agent token consumption",
    ["graph_name", "model", "type"],
)
node_iteration_counter = Counter(
    "agent_node_iterations_total",
    "Node execution count (dead‑loop detection)",
    ["graph_name", "node_name"],
)
active_runs_gauge = Gauge(
    "agent_active_runs",
    "Number of currently running agents",
    ["graph_name"],
)

class PrometheusCallbackHandler(BaseCallbackHandler):
    def __init__(self, graph_name: str):
        super().__init__()
        self.graph_name = graph_name
        self._run_timers: dict[str, float] = {}

    def on_chain_start(self, serialized: dict, inputs: dict, run_id, **kwargs):
        self._run_timers[str(run_id)] = time.time()
        active_runs_gauge.labels(graph_name=self.graph_name).inc()

    def on_chain_end(self, outputs: dict, run_id, **kwargs):
        start = self._run_timers.pop(str(run_id), None)
        if start:
            duration = time.time() - start
            node_execution_duration.labels(graph_name=self.graph_name, node_name="chain").observe(duration)
        active_runs_gauge.labels(graph_name=self.graph_name).dec()

    def on_llm_end(self, response, run_id, **kwargs):
        usage = getattr(response, "llm_output", {}) or {}
        token_usage = usage.get("token_usage", {})
        model = (
            response.generations[0][0].generation_info.get("model", "unknown")
            if response.generations else "unknown"
        )
        token_usage_counter.labels(graph_name=self.graph_name, model=model, type="prompt").inc(
            token_usage.get("prompt_tokens", 0)
        )
        token_usage_counter.labels(graph_name=self.graph_name, model=model, type="completion").inc(
            token_usage.get("completion_tokens", 0)
        )

    def on_chain_error(self, error: Exception, run_id, **kwargs):
        self._run_timers.pop(str(run_id), None)
        active_runs_gauge.labels(graph_name=self.graph_name).dec()
        print(f"[Agent Error] graph={self.graph_name} run_id={run_id}: {error}")

Cost tracker (Python):

TOKEN_COST = {
    "gpt-4o": {"input": 0.005, "output": 0.015},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
    "claude-3-haiku": {"input": 0.00025, "output": 0.00125},
}

class CostTracker:
    def __init__(self):
        self.daily_usage = {}
        self.alert_thresholds = {"per_request": 0.5, "daily": 50}

    def record_usage(self, record):
        cost = self._calculate_cost(record)
        self.daily_usage[record.graph_name] = self.daily_usage.get(record.graph_name, 0) + cost
        if cost > self.alert_thresholds["per_request"]:
            self._trigger_alert({"level": "warning", "type": "per_request_cost",
                                 "graph_name": record.graph_name, "cost": cost})
        if self.daily_usage[record.graph_name] > self.alert_thresholds["daily"]:
            self._trigger_alert({"level": "critical", "type": "daily_budget_exceeded",
                                 "graph_name": record.graph_name, "cost": self.daily_usage[record.graph_name]})

    def _calculate_cost(self, record):
        pricing = TOKEN_COST.get(record.model)
        if not pricing:
            return 0.0
        return (record.prompt_tokens / 1000) * pricing["input"] + (record.completion_tokens / 1000) * pricing["output"]

    def _trigger_alert(self, alert):
        print(f"[Cost Alert] {alert}")

TypeScript cost tracker mirrors the same thresholds and uses the same TOKEN_COST map.

07 Production pitfalls

Pitfall 1: HTTP 200 does not guarantee success; inspect response.body.error or LangSmith run.error.

Pitfall 2: In streaming mode handleLLMEnd fires only after the stream ends; use handleLLMNewToken for incremental token counting.

Pitfall 3: Do not use volatile identifiers such as runId as Prometheus labels; keep labels stable ( graph_name, node_name, model).

Pitfall 4: Set alert thresholds based on observed traffic; overly low P95 latency thresholds cause alert fatigue.

Pitfall 5: Monitor the monitoring stack itself (e.g., Prometheus up metric) to detect watchdog failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringLLMLangChainAgentPrometheusCostAlert
James' Growth Diary
Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.