Blast Radius Mapping: Understanding AI Agent Dependencies
Your slack-mcp goes down. How many agents are affected? Which sessions will fail? How many users are impacted? Without dependency mapping, the answer to all three is "we do not know." Blast radius mapping gives you the dependency graph to answer these questions before incidents happen.

What is blast radius?
In cloud infrastructure, "blast radius" describes the scope of impact when a component fails. AWS's Fault Injection Simulator uses the term to describe how a failure in one service propagates to dependent services. A database failure with a blast radius of 3 services is contained. A DNS failure with a blast radius of every service in the VPC is catastrophic.
In AI agent systems, blast radius answers: "If this MCP server goes down, what breaks?" The answer is never just "the tool that called it." The answer includes every agent that uses that tool, every session running through those agents, every multi-agent handoff chain that touches those agents, and every end user waiting for a response.
Hidden dependencies in multi-agent systems
Simple agent deployments have straightforward dependencies: Agent A uses tools X, Y, Z on MCP server M. If M goes down, Agent A's sessions fail. The blast radius is one agent.
Multi-agent systems have hidden dependencies that are not visible from any single agent's configuration:
# Visible dependency: support-agent uses postgres-mcp support-agent → postgres-mcp/query support-agent → postgres-mcp/get_customer support-agent → slack-mcp/send_message # Hidden dependency: triage-agent hands off to support-agent triage-agent → [handoff] → support-agent # Deeper hidden dependency: escalation-agent depends on triage escalation-agent → [handoff] → triage-agent → [handoff] → support-agent # If postgres-mcp goes down: # - support-agent fails directly (uses postgres-mcp) # - triage-agent fails indirectly (hands off to support-agent) # - escalation-agent fails transitively (depends on triage-agent) # Blast radius: 3 agents, not 1
The transitive dependency is the dangerous one. The escalation-agent does not directly call postgres-mcp. Its configuration does not mention postgres-mcp. But when postgres-mcp goes down, escalation-agent's sessions fail because the handoff chain is broken.
Without blast radius mapping, the engineer investigating the escalation-agent failure has no idea that the root cause is a database MCP server three layers away in the dependency chain.
Building the dependency graph
LangSight builds the agent-to-tool dependency graph automatically by observing real agent sessions. Every time an agent calls a tool, LangSight records the (agent, tool, server) tuple. Every time an agent hands off to another agent, LangSight records the (parent_agent, child_agent) relationship.
Over time, the dependency graph emerges from real usage patterns — not from static configuration that may be incomplete or outdated.
$ langsight investigate --topology
Agent Dependency Graph
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
escalation-agent
└─▶ triage-agent (handoff)
└─▶ support-agent (handoff)
├── postgres-mcp [query, get_customer, update_ticket]
├── slack-mcp [send_message, get_channel]
└── jira-mcp [create_issue, update_issue]
billing-agent
├── postgres-mcp [query, get_invoice]
└── stripe-mcp [get_payment, create_refund]
onboarding-agent
├── postgres-mcp [get_customer, create_account]
├── email-mcp [send_welcome, send_verification]
└── slack-mcp [send_message]
Shared dependencies:
postgres-mcp → 3 agents (support, billing, onboarding)
slack-mcp → 2 agents (support, onboarding)
jira-mcp → 1 agent (support)The blast radius panel
When a tool or MCP server's health status changes (UP to DOWN, UP to DEGRADED), LangSight calculates the blast radius in real time and includes it in the alert:
# Alert: postgres-mcp DOWN
# Blast radius calculation:
Direct dependents:
support-agent — uses 3 tools on postgres-mcp
billing-agent — uses 2 tools on postgres-mcp
onboarding-agent — uses 2 tools on postgres-mcp
Transitive dependents (via handoff chains):
triage-agent — hands off to support-agent
escalation-agent — hands off to triage-agent
Impact estimate (based on last 24h traffic):
Sessions affected: ~180/hour
Users impacted: ~120/hour
Handoff chains broken: 2
escalation-agent → triage-agent → support-agent
triage-agent → support-agent
Circuit breakers opened:
support-agent/postgres-mcp/query — OPEN since 03:14 UTC
billing-agent/postgres-mcp/get_invoice — OPEN since 03:14 UTC
onboarding-agent/postgres-mcp/get_customer — OPEN since 03:14 UTCThe alert does not just say "postgres-mcp is down." It says "postgres-mcp is down, 5 agents are affected, 180 sessions per hour will fail, and 2 handoff chains are broken." This context enables faster incident response because the on-call engineer immediately understands the scope and can prioritize accordingly.
Using blast radius for capacity planning
Beyond incident response, the dependency graph is valuable for capacity planning and architecture decisions.
Identify single points of failure. If postgres-mcp is used by 5 out of 7 agents, it is a single point of failure. Consider adding a read replica MCP server, implementing connection pooling, or adding caching to reduce the dependency.
Plan maintenance windows. Before taking an MCP server offline for maintenance, check its blast radius. An MCP server used by one non-critical agent can be taken down during business hours. An MCP server used by five agents including the customer-facing support agent should be maintained during off-peak hours with a failover in place.
Right-size circuit breaker thresholds. An MCP server with a large blast radius (5+ agents) should have aggressive circuit breaker settings (fail fast, shorter cooldowns). An MCP server used by one non-critical agent can have more lenient settings.
Impact-aware alerting
Traditional alerting treats all failures equally: any server DOWN triggers the same alert. Impact-aware alerting uses the blast radius to set alert severity dynamically.
# .langsight.yaml — impact-aware alert configuration
alerts:
impact_aware: true
rules:
- condition: "server.status == DOWN"
severity_override:
# Blast radius >= 5 agents → page immediately
blast_radius_agents >= 5: critical
# Blast radius >= 3 agents → urgent alert
blast_radius_agents >= 3: high
# Blast radius == 1 agent → standard alert
blast_radius_agents == 1: mediumWhen postgres-mcp (blast radius: 5 agents) goes down, it pages the on-call engineer immediately. When analytics-mcp (blast radius: 1 non-critical agent) goes down, it sends a standard Slack notification. The alert severity matches the actual impact, reducing alert fatigue while ensuring critical failures get immediate attention.
Lineage tracking for debugging
During incident investigation, the dependency graph helps trace failures back to their root cause. When a user reports that the escalation-agent is not working:
- Check escalation-agent's health → healthy
- Check escalation-agent's dependencies → depends on triage-agent (handoff)
- Check triage-agent's health → healthy, but sessions failing
- Check triage-agent's dependencies → depends on support-agent (handoff)
- Check support-agent's health → sessions failing
- Check support-agent's tool dependencies → postgres-mcp is DOWN
The dependency graph turns a "the agent does not work" report into a "postgres-mcp is the root cause" diagnosis in seconds, following the dependency chain automatically instead of requiring manual investigation at each layer.
Key takeaways
- Multi-agent systems have hidden transitive dependencies. An agent three handoff layers away from a failed MCP server still fails. Without blast radius mapping, the root cause is invisible.
- Build the graph from real usage, not configuration. LangSight observes actual agent sessions to build the dependency graph. Static configuration is always incomplete.
- Include blast radius in every DOWN alert. "postgres-mcp is down, 5 agents affected, 180 sessions/hour impacted" enables faster triage than "postgres-mcp is down."
- Use blast radius for capacity planning. Identify single points of failure, plan maintenance windows, and right-size circuit breaker thresholds based on dependency data.
- Impact-aware alerting reduces noise. Page for high-blast-radius failures, notify for low-blast-radius failures. Match alert severity to actual impact.
Related articles
- Circuit Breakers for AI Agents — When the blast radius is known, circuit breakers prevent failures from propagating through the dependency graph.
- How to Monitor MCP Servers in Production — Proactive health monitoring detects outages. Blast radius mapping tells you the impact.
- Setting SLOs for AI Agents — Impact-aware SLOs use blast radius data to set appropriate reliability targets per agent.
- Schema Drift in MCP — Schema drift in a high-blast-radius MCP server can silently break dozens of agent sessions.
Map your agent dependencies
LangSight builds the agent-to-tool dependency graph automatically. See blast radius on every alert, plan maintenance with confidence, and trace failures to their root cause. Self-host free, Apache 2.0.
Get started →