Blast RadiusDependenciesReliability

Blast Radius Mapping: Understanding AI Agent Dependencies

Name: LangSight
Author: LangSight

Your slack-mcp goes down. How many agents are affected? Which sessions will fail? How many users are impacted? Without dependency mapping, the answer to all three is "we do not know." Blast radius mapping gives you the dependency graph to answer these questions before incidents happen.

April 2, 2026·8 min read·LangSight Engineering

Blast Radius Mapping: Understanding AI Agent Dependencies

Blast RadiusKnow what breaks when a tool goes down

What is blast radius?

In cloud infrastructure, "blast radius" describes the scope of impact when a component fails. AWS's Fault Injection Simulator uses the term to describe how a failure in one service propagates to dependent services. A database failure with a blast radius of 3 services is contained. A DNS failure with a blast radius of every service in the VPC is catastrophic.

In AI agent systems, blast radius answers: "If this MCP server goes down, what breaks?" The answer is never just "the tool that called it." The answer includes every agent that uses that tool, every session running through those agents, every multi-agent handoff chain that touches those agents, and every end user waiting for a response.

Hidden dependencies in multi-agent systems

Simple agent deployments have straightforward dependencies: Agent A uses tools X, Y, Z on MCP server M. If M goes down, Agent A's sessions fail. The blast radius is one agent.

Multi-agent systems have hidden dependencies that are not visible from any single agent's configuration:

# Visible dependency: support-agent uses postgres-mcp
support-agent → postgres-mcp/query
support-agent → postgres-mcp/get_customer
support-agent → slack-mcp/send_message

# Hidden dependency: triage-agent hands off to support-agent
triage-agent → [handoff] → support-agent

# Deeper hidden dependency: escalation-agent depends on triage
escalation-agent → [handoff] → triage-agent → [handoff] → support-agent

# If postgres-mcp goes down:
# - support-agent fails directly (uses postgres-mcp)
# - triage-agent fails indirectly (hands off to support-agent)
# - escalation-agent fails transitively (depends on triage-agent)
# Blast radius: 3 agents, not 1

The transitive dependency is the dangerous one. The escalation-agent does not directly call postgres-mcp. Its configuration does not mention postgres-mcp. But when postgres-mcp goes down, escalation-agent's sessions fail because the handoff chain is broken.

Without blast radius mapping, the engineer investigating the escalation-agent failure has no idea that the root cause is a database MCP server three layers away in the dependency chain.

Building the dependency graph

LangSight builds the agent-to-tool dependency graph automatically by observing real agent sessions. Every time an agent calls a tool, LangSight records the (agent, tool, server) tuple. Every time an agent hands off to another agent, LangSight records the (parent_agent, child_agent) relationship.

Over time, the dependency graph emerges from real usage patterns — not from static configuration that may be incomplete or outdated.

$ langsight investigate --topology

Agent Dependency Graph
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  escalation-agent
    └─▶ triage-agent (handoff)
         └─▶ support-agent (handoff)
              ├── postgres-mcp  [query, get_customer, update_ticket]
              ├── slack-mcp     [send_message, get_channel]
              └── jira-mcp      [create_issue, update_issue]

  billing-agent
    ├── postgres-mcp  [query, get_invoice]
    └── stripe-mcp    [get_payment, create_refund]

  onboarding-agent
    ├── postgres-mcp  [get_customer, create_account]
    ├── email-mcp     [send_welcome, send_verification]
    └── slack-mcp     [send_message]

Shared dependencies:
  postgres-mcp  → 3 agents (support, billing, onboarding)
  slack-mcp     → 2 agents (support, onboarding)
  jira-mcp      → 1 agent  (support)

The blast radius panel

When a tool or MCP server's health status changes (UP to DOWN, UP to DEGRADED), LangSight calculates the blast radius in real time and includes it in the alert:

# Alert: postgres-mcp DOWN
# Blast radius calculation:

Direct dependents:
  support-agent      — uses 3 tools on postgres-mcp
  billing-agent      — uses 2 tools on postgres-mcp
  onboarding-agent   — uses 2 tools on postgres-mcp

Transitive dependents (via handoff chains):
  triage-agent       — hands off to support-agent
  escalation-agent   — hands off to triage-agent

Impact estimate (based on last 24h traffic):
  Sessions affected:  ~180/hour
  Users impacted:     ~120/hour
  Handoff chains broken: 2
    escalation-agent → triage-agent → support-agent
    triage-agent → support-agent

Circuit breakers opened:
  support-agent/postgres-mcp/query       — OPEN since 03:14 UTC
  billing-agent/postgres-mcp/get_invoice — OPEN since 03:14 UTC
  onboarding-agent/postgres-mcp/get_customer — OPEN since 03:14 UTC

The alert does not just say "postgres-mcp is down." It says "postgres-mcp is down, 5 agents are affected, 180 sessions per hour will fail, and 2 handoff chains are broken." This context enables faster incident response because the on-call engineer immediately understands the scope and can prioritize accordingly.

Using blast radius for capacity planning

Beyond incident response, the dependency graph is valuable for capacity planning and architecture decisions.

Identify single points of failure. If postgres-mcp is used by 5 out of 7 agents, it is a single point of failure. Consider adding a read replica MCP server, implementing connection pooling, or adding caching to reduce the dependency.

Plan maintenance windows. Before taking an MCP server offline for maintenance, check its blast radius. An MCP server used by one non-critical agent can be taken down during business hours. An MCP server used by five agents including the customer-facing support agent should be maintained during off-peak hours with a failover in place.

Right-size circuit breaker thresholds. An MCP server with a large blast radius (5+ agents) should have aggressive circuit breaker settings (fail fast, shorter cooldowns). An MCP server used by one non-critical agent can have more lenient settings.

Impact-aware alerting

Traditional alerting treats all failures equally: any server DOWN triggers the same alert. Impact-aware alerting uses the blast radius to set alert severity dynamically.

# .langsight.yaml — impact-aware alert configuration
alerts:
  impact_aware: true
  rules:
    - condition: "server.status == DOWN"
      severity_override:
        # Blast radius >= 5 agents → page immediately
        blast_radius_agents >= 5: critical
        # Blast radius >= 3 agents → urgent alert
        blast_radius_agents >= 3: high
        # Blast radius == 1 agent → standard alert
        blast_radius_agents == 1: medium

When postgres-mcp (blast radius: 5 agents) goes down, it pages the on-call engineer immediately. When analytics-mcp (blast radius: 1 non-critical agent) goes down, it sends a standard Slack notification. The alert severity matches the actual impact, reducing alert fatigue while ensuring critical failures get immediate attention.

Lineage tracking for debugging

During incident investigation, the dependency graph helps trace failures back to their root cause. When a user reports that the escalation-agent is not working:

Check escalation-agent's health → healthy
Check escalation-agent's dependencies → depends on triage-agent (handoff)
Check triage-agent's health → healthy, but sessions failing
Check triage-agent's dependencies → depends on support-agent (handoff)
Check support-agent's health → sessions failing
Check support-agent's tool dependencies → postgres-mcp is DOWN

The dependency graph turns a "the agent does not work" report into a "postgres-mcp is the root cause" diagnosis in seconds, following the dependency chain automatically instead of requiring manual investigation at each layer.

Key takeaways

Multi-agent systems have hidden transitive dependencies. An agent three handoff layers away from a failed MCP server still fails. Without blast radius mapping, the root cause is invisible.
Build the graph from real usage, not configuration. LangSight observes actual agent sessions to build the dependency graph. Static configuration is always incomplete.
Include blast radius in every DOWN alert. "postgres-mcp is down, 5 agents affected, 180 sessions/hour impacted" enables faster triage than "postgres-mcp is down."
Use blast radius for capacity planning. Identify single points of failure, plan maintenance windows, and right-size circuit breaker thresholds based on dependency data.
Impact-aware alerting reduces noise. Page for high-blast-radius failures, notify for low-blast-radius failures. Match alert severity to actual impact.

Circuit Breakers for AI Agents — When the blast radius is known, circuit breakers prevent failures from propagating through the dependency graph.
How to Monitor MCP Servers in Production — Proactive health monitoring detects outages. Blast radius mapping tells you the impact.
Setting SLOs for AI Agents — Impact-aware SLOs use blast radius data to set appropriate reliability targets per agent.
Schema Drift in MCP — Schema drift in a high-blast-radius MCP server can silently break dozens of agent sessions.

Map your agent dependencies

LangSight builds the agent-to-tool dependency graph automatically. See blast radius on every alert, plan maintenance with confidence, and trace failures to their root cause. Self-host free, Apache 2.0.

Get started →