Circuit BreakerReliabilityFault Tolerance

Circuit Breakers for AI Agents: Preventing Cascading Failures

Name: LangSight
Author: LangSight

Your postgres-mcp goes down at 3 AM. Three agents depend on it. Without circuit breakers, every session that touches those agents burns tokens trying to call a tool that will never respond, times out after 30 seconds, and returns an unhelpful error. With circuit breakers, the agent knows the tool is down before it tries, fails fast, and can route to a fallback.

April 2, 2026·9 min read·LangSight Engineering

Circuit Breakers for AI Agents: Preventing Cascading Failures

Circuit BreakersPrevent cascading failures across agents

The cascading failure problem

In a traditional microservices architecture, a circuit breaker prevents a failed downstream service from taking down the entire system. When Service A calls Service B and B is down, the circuit breaker in A detects the failure pattern and stops sending requests to B — preventing thread pool exhaustion, timeout cascades, and resource waste.

AI agents have the same problem, but worse. When an agent calls a tool on a downed MCP server, it does not just waste a network request. It wastes an entire LLM reasoning step. The agent decides to call the tool (LLM cost), constructs the arguments (LLM cost), waits for the timeout (wall clock time), processes the error response (LLM cost), decides to retry (LLM cost), and repeats. A single failed tool call can cost 3-4x what a successful call costs because the agent's retry and error-handling reasoning is expensive.

In multi-agent systems, the cascade is even worse. Agent A calls tool X (down). Agent A fails. Agent B, which depends on Agent A's output via a handoff, also fails. Agent C, which depends on Agent B, also fails. A single MCP server outage propagates through the agent graph, failing every session that touches any agent in the dependency chain.

The circuit breaker pattern

The circuit breaker pattern, borrowed from electrical engineering and popularized in software by Netflix's Hystrix library, has three states:

Closed (normal operation)

The circuit is closed — tool calls pass through normally. The circuit breaker tracks the success and failure rates. As long as the failure rate stays below the threshold, the circuit stays closed.

Open (failing, stop calling)

When the failure count exceeds the threshold (for example, 5 consecutive failures), the circuit opens. All subsequent tool calls are immediately rejected without contacting the MCP server. The agent receives an immediate error: "Circuit open: postgres-mcp is currently unavailable."

This is the key insight: instead of waiting 30 seconds for a timeout on every call, the agent gets an immediate failure. The LLM can then decide what to do — use a cached result, skip that step, or inform the user — without burning tokens on retry attempts that will never succeed.

Half-open (testing recovery)

After a cooldown period (for example, 60 seconds), the circuit transitions to half-open. One test call is allowed through to the MCP server. If it succeeds, the circuit closes (normal operation resumes). If it fails, the circuit reopens for another cooldown period.

# Circuit breaker state machine
#
#   ┌──────────┐  N failures   ┌──────────┐
#   │  CLOSED  │──────────────▶│   OPEN   │
#   │ (normal) │               │ (reject) │
#   └──────────┘               └──────────┘
#        ▲                          │
#        │  success                 │ cooldown expires
#        │                          ▼
#        │                    ┌───────────┐
#        └────────────────────│ HALF-OPEN │
#                             │  (probe)  │
#                   failure → │   reopen  │
#                             └───────────┘

Why agents need circuit breakers specifically

Traditional circuit breakers protect services from wasting resources on failed calls. Agent circuit breakers protect against two additional costs:

Token waste: Every failed tool call triggers LLM reasoning about the failure, retry decisions, and error handling. With gpt-4o or Claude 3.5 Sonnet, this reasoning costs real money. A circuit breaker prevents the agent from ever reaching the "should I retry?" decision for a known-down tool.
Session quality: An agent that spends 3 of its 5 reasoning steps dealing with a failed tool produces a worse final answer than an agent that immediately knows the tool is down and adjusts its strategy. Fast failure enables graceful degradation.

Configuring circuit breakers in LangSight

from langsight.sdk import LangSightClient

client = LangSightClient(
    url="http://localhost:8000",
    api_key="ls_...",

    # Circuit breaker configuration
    circuit_breaker=True,
    failure_threshold=5,     # 5 consecutive failures → open circuit
    cooldown_seconds=60,     # wait 60s before probing recovery
    half_open_max_calls=1,   # allow 1 test call in half-open state
)

traced = client.wrap(mcp_session, agent_name="support-agent")

# When a tool's circuit is open, the call returns immediately:
# ToolCircuitOpenError("postgres-mcp/query: circuit open since 03:14:22 UTC")
# The agent can handle this gracefully instead of waiting for a timeout

Configuration can also be set per tool or per MCP server:

# .langsight.yaml — per-server circuit breaker config
servers:
  - name: postgres-mcp
    transport: stdio
    command: "python server.py"
    circuit_breaker:
      enabled: true
      failure_threshold: 3      # critical tool — trip faster
      cooldown_seconds: 30      # recover faster

  - name: analytics-mcp
    transport: sse
    url: "https://mcp.internal/analytics"
    circuit_breaker:
      enabled: true
      failure_threshold: 10     # non-critical — more tolerance
      cooldown_seconds: 120     # slower recovery probe

Blast radius: which agents are affected?

When a circuit opens, the critical question is: which agents are affected? LangSight tracks the agent-to-tool dependency graph, so when a circuit opens, the alert includes:

Which agents use the affected tool
How many active sessions are currently running through those agents
Which handoff chains are broken (multi-agent dependencies)
Estimated session failure rate during the outage

# Circuit open alert (Slack)
⚠️ Circuit OPEN: postgres-mcp/query
  Since: 2026-04-02 03:14:22 UTC
  Cause: 5 consecutive timeouts (avg 31.2s)

  Affected agents:
    support-agent      — 12 active sessions
    billing-agent      — 3 active sessions
    onboarding-agent   — 7 active sessions

  Handoff chains broken:
    triage-agent → support-agent → escalation-agent

  Estimated impact: ~120 sessions/hour
  Recovery probe: next attempt in 30s

Combining circuit breakers with health monitoring

Circuit breakers react to failures detected during real agent sessions. Health monitoring proactively detects failures before any agent is affected. The two systems work together:

Health monitoring detects the outage first. A proactive health probe at 03:14:00 detects that postgres-mcp is not responding. An alert fires.
Circuit breaker prevents waste. At 03:14:22, the first agent session tries to call postgres-mcp. The health monitor has already flagged it as DOWN, so the circuit pre-opens. The agent gets an immediate failure instead of a 30-second timeout.
Health monitoring detects recovery. At 03:18:00, the health probe succeeds. The circuit transitions to half-open. The next agent session's tool call is allowed through. It succeeds. The circuit closes. A recovery alert fires.

LangSight integrates health monitoring and circuit breakers into a single system. The health checker's DOWN detection automatically opens circuits. The health checker's UP detection automatically transitions circuits to half-open. This eliminates the recovery delay that would occur if the circuit breaker had to wait for a real agent session to probe recovery.

Graceful degradation patterns

A circuit breaker that rejects a tool call is only useful if the agent can handle the rejection gracefully. There are several patterns:

Inform and skip

The agent tells the user that a specific capability is temporarily unavailable. "I cannot access the database right now, but I can answer from the information I already have."

Fallback tool

Configure a fallback MCP server for critical tools. If postgres-mcp is down, route queries to a read replica MCP server. LangSight supports fallback routing in the SDK configuration.

Cached response

For tools that return relatively stable data (customer records, configuration values), cache the last successful response and return it when the circuit is open. The cached data may be stale, but stale data is often better than no data.

Key takeaways

Agents waste tokens on failed tools. Without circuit breakers, every failed tool call triggers expensive LLM reasoning about retries and error handling. Circuit breakers prevent the agent from ever reaching that point.
Three states: closed, open, half-open. Closed is normal. Open rejects immediately. Half-open probes for recovery. This is the same pattern used in microservices, adapted for the AI agent context.
Combine with health monitoring. Proactive health probes can pre-open circuits before any agent session is affected. Health-detected recovery can transition circuits to half-open faster than passive detection.
Blast radius awareness is critical. When a circuit opens, know which agents, sessions, and handoff chains are affected. This context in the alert enables faster incident response.
Design for graceful degradation. A circuit breaker that rejects calls is only half the solution. The agent must handle the rejection — inform the user, use a fallback, or return cached data.

Blast Radius Mapping — When a circuit opens, know exactly which agents, sessions, and handoff chains are affected.
How to Monitor MCP Servers in Production — Proactive health monitoring detects outages before circuit breakers need to trip.
AI Agent Loop Detection — Loops and cascading failures are related failure modes. Use both circuit breakers and loop detection together.
Setting SLOs for AI Agents — Circuit breakers improve your SLO metrics by preventing token waste on known-failed tools.

Add circuit breakers to your agents

LangSight adds circuit breakers, health monitoring, and blast radius analysis to any agent system. Prevent cascading failures before they reach your users. Self-host free, Apache 2.0.

Get started →