How to Detect and Stop AI Agent Loops in Production
AI agent loops are the most common production failure mode. The agent calls the same tool with the same arguments, over and over, burning tokens and producing nothing. Here's how to detect and stop them automatically.
The $200 loop you didn't see coming
A support agent at a fintech company was deployed to handle billing queries. It worked perfectly in staging. On day three of production, a single session ran for 47 minutes, called crm-mcp/lookup_customer 89 times with identical arguments, and cost $214 before someone manually killed it.
The root cause: the CRM server returned a slightly malformed response. The agent decided it needed more data, called the same tool again, got the same malformed response, and repeated. No circuit breaker. No loop detection. No budget limit. The agent was doing exactly what it was programmed to do — retry until it got a good response.
This is not an edge case. It is the most common failure mode in production AI agent systems.
What is an AI agent loop?
An agent loop occurs when an agent calls the same tool (or sequence of tools) repeatedly without making meaningful progress. There are three distinct patterns:
1. Direct repetition
The simplest form: the same tool called with identical arguments multiple times in a row.
# Direct repetition pattern
postgres-mcp/query("SELECT * FROM orders WHERE id = 1234")
postgres-mcp/query("SELECT * FROM orders WHERE id = 1234")
postgres-mcp/query("SELECT * FROM orders WHERE id = 1234")
# ... 44 more timesThis happens when the tool returns an error or unexpected result and the LLM's retry logic doesn't distinguish between "transient failure, retry" and "structural failure, give up."
2. Ping-pong between tools
Two tools are called alternately without state change between calls. The agent calls tool A, gets a result, calls tool B, gets a result, then calls tool A again with the same arguments.
# Ping-pong pattern crm-mcp/get_customer(id=456) # returns customer billing-mcp/get_invoices(customer=456) # returns invoices crm-mcp/get_customer(id=456) # same call again billing-mcp/get_invoices(customer=456) # same call again
3. Retry-without-progress
The tool call succeeds (no error) but the response doesn't satisfy the agent's internal goal. The agent keeps calling with minor variations in arguments, never converging on a solution.
Why standard retry logic makes this worse
Most agent frameworks include retry logic for transient failures — network timeouts, rate limits, temporary server errors. This is correct behavior. But the same retry logic, applied naively, turns a recoverable tool failure into an infinite loop.
The problem is that retry logic operates at the tool call level: "this call failed, retry it." Loop detection needs to operate at the session level: "this tool has been called N times with the same arguments, the session is stuck."
These are different problems requiring different solutions.
How to detect agent loops: three approaches
Approach 1: Argument hash comparison
The most reliable detection method. For each tool call, compute a normalized hash of the tool name and input arguments. If the same hash appears N times within a session window, the agent is looping.
import hashlib
import json
from collections import Counter
def normalize_args(args: dict) -> str:
"""Sort keys for consistent hashing."""
return json.dumps(args, sort_keys=True)
def compute_call_hash(tool_name: str, args: dict) -> str:
payload = f"{tool_name}:{normalize_args(args)}"
return hashlib.sha256(payload.encode()).hexdigest()[:16]
class LoopDetector:
def __init__(self, threshold: int = 3):
self.threshold = threshold
self.call_counts: Counter = Counter()
def record_call(self, tool_name: str, args: dict) -> bool:
"""Returns True if a loop is detected."""
call_hash = compute_call_hash(tool_name, args)
self.call_counts[call_hash] += 1
return self.call_counts[call_hash] >= self.thresholdApproach 2: Sliding window rate detection
Argument hashing catches exact repetition. Sliding window detection catches high-frequency calls regardless of argument variation. If a tool is called more than N times in M seconds, something is wrong.
from collections import deque
from datetime import datetime, timedelta
class RateLoopDetector:
def __init__(self, max_calls: int = 10, window_seconds: int = 60):
self.max_calls = max_calls
self.window = timedelta(seconds=window_seconds)
self.call_times: dict[str, deque] = {}
def record_call(self, tool_name: str) -> bool:
now = datetime.utcnow()
if tool_name not in self.call_times:
self.call_times[tool_name] = deque()
times = self.call_times[tool_name]
# Remove calls outside the window
while times and now - times[0] > self.window:
times.popleft()
times.append(now)
return len(times) >= self.max_callsApproach 3: LLM output similarity
The most sophisticated approach: detect when the LLM is generating the same reasoning steps repeatedly. Compare semantic similarity between consecutive reasoning outputs. High similarity (> 0.95 cosine) across multiple steps indicates the agent is reasoning in circles.
This is computationally expensive and usually overkill. Approaches 1 and 2 catch > 90% of real-world loops.
What to do when a loop is detected
You have three options, configured per agent:
Option 1: Warn and continue
Log the detection, fire an alert, but let the agent keep running. Use this when you want visibility without disruption — good for early monitoring before you're confident in your detection thresholds.
Option 2: Terminate the session
Hard stop. The session is marked loop_detected, the agent receives a termination signal, and a structured error is returned to the caller. This is the right default for production.
Option 3: Inject a recovery message
Instead of terminating, inject a system message telling the agent it is stuck: "You have called [tool] with the same arguments 3 times without progress. Stop and summarize what you know so far, then decide on a different approach." This gives the agent a chance to self-recover before forcing termination.
Integrating loop detection with LangSight
LangSight's SDK handles all three detection approaches automatically. Two lines of code:
from langsight.sdk import LangSightClient
client = LangSightClient(
url="http://localhost:8000",
api_key="ls_...",
loop_detection=True,
loop_threshold=3, # same tool+args 3x = loop
loop_action="terminate", # or "warn" or "inject"
)
# Wrap your MCP session
traced = client.wrap(mcp_session, server_name="crm-mcp", agent_name="support-agent")
# Every tool call is now monitored
result = await traced.call_tool("lookup_customer", {"id": 456})When a loop is detected, LangSight fires an alert with full context: which tool was looping, how many times it was called, the session ID, and the repeated arguments. The session is tagged loop_detected in the dashboard and filterable in the sessions view.
Budget guardrails: the backstop
Loop detection stops infinite repetition of identical calls. Budget guardrails are the backstop for everything else — they stop any session that exceeds a cost or step limit, regardless of the cause.
client = LangSightClient(
url="http://localhost:8000",
max_cost_usd=1.00, # hard stop at $1
max_steps=25, # hard stop at 25 tool calls
max_wall_time_s=120, # hard stop at 2 minutes
budget_soft_alert=0.80 # alert at 80% of budget
)Use both together: loop detection for the known failure pattern, budget guardrails for unknown failure patterns you haven't anticipated yet.
Setting the right thresholds
The default threshold of 3 identical calls works for most agents. But some legitimate patterns require adjustment:
- Polling agents — agents that legitimately poll a status endpoint should use time-based windows, not count-based detection. Set a longer window or whitelist the polling tool.
- Retry-heavy workflows — if your agent is designed to retry operations, increase the threshold to 5 or 7 to avoid false positives on transient failures.
- Sub-agents — each sub-agent in a multi-agent system should have its own loop detector. A parent agent calling the same sub-agent multiple times is not a loop — the sub-agents calling the same tools multiple times is.
Key takeaways
- Agent loops are the most common production failure. Every production agent system will experience at least one.
- Standard retry logic makes loops worse. Detection needs to operate at the session level, not the call level.
- Argument hash comparison catches > 90% of real loops with zero false positives at threshold 3.
- Always combine loop detection with budget guardrails. Loop detection catches the known pattern; budget guardrails catch everything else.
- Start with
loop_action="warn", monitor for a week, then switch to"terminate"once you're confident in your thresholds.
Stop agent loops in production
LangSight adds loop detection, budget guardrails, and circuit breakers to any agent in two lines of code. Self-host free.
Get started →