MCP & Agent Runtime Reliability — Key Terms

Agent Runtime Reliability Glossary

Plain-English definitions for the terms you'll encounter when building, monitoring, and securing AI agent toolchains — runtime reliability, circuit breakers, MCP health, and more.

MCP Server

A process that exposes tools, prompts, or resources to an AI agent via the Model Context Protocol.

The Model Context Protocol (MCP) is an open standard that lets AI agents call external capabilities in a structured, discoverable way. An MCP server is any process — a database connector, a file system wrapper, a REST API adapter, a Git client — that speaks the MCP protocol. Agents connect to MCP servers and discover the available tools by requesting the server's tool schema. When the agent calls a tool, the MCP server executes the corresponding logic and returns the result.

MCP servers run over three transport types: stdio (subprocess), Server-Sent Events (SSE), and StreamableHTTP. Each transport has different security and network exposure characteristics. LangSight monitors MCP servers across all three transports.

MCP Observability

The practice of instrumenting, monitoring, and understanding the behavior of MCP servers and the tool calls made against them.

Observability is the ability to understand what a system is doing from the outside — without modifying it — by examining its outputs (traces, metrics, logs). MCP observability applies this to the Model Context Protocol layer of an AI agent stack.

A fully observable MCP deployment gives you: complete traces of every tool call (which server, which tool, what arguments, what result, how long it took), health metrics (uptime, latency trends, error rates), security scan results (CVEs, OWASP findings), and schema snapshots (so you know when a tool's interface changed). LangSight provides all of these in a single platform.

Tool Call Tracing

Recording the full lifecycle of a tool invocation by an AI agent, including the arguments sent, the result returned, latency, and any errors.

When an AI agent decides to use a tool — for example, querying a database, reading a file, or calling an API — it produces a tool call. Tracing that call means capturing a structured record of everything that happened: which tool was called, the exact arguments the agent provided, the result the tool returned, how long it took, and whether it succeeded or failed.

In multi-agent systems, tool call traces form a tree: a root agent may call a sub-agent, which in turn calls several MCP tools, each producing its own trace span. Reconstructing this tree is essential for debugging — when your agent fails, you need to know exactly which tool call produced the wrong result, timed out, or threw an exception. LangSight captures and visualizes this full call tree.

Schema Drift Detection

Automatically detecting when an MCP server's tool schema changes unexpectedly between scans — a signal of unplanned deployments or potential supply chain attacks.

Every MCP tool has a schema: a JSON definition of its input parameters, types, and descriptions. When a tool's schema changes — a new parameter is added, an existing one is renamed, or the description is modified — this is schema drift.

Schema drift can be benign (a planned version upgrade) or dangerous (a compromised MCP server with a modified tool description injecting malicious instructions). LangSight takes a snapshot of each tool's schema on every health check and compares it to the previous snapshot. Unexpected changes trigger an alert, giving you a window to investigate before the change propagates to production agents. This is also classified as OWASP MCP-04 (Rug Pull).

Tool Poisoning

An attack where an MCP server's tool description is modified to contain hidden instructions that manipulate agent behavior.

Tool poisoning exploits the fact that AI agents read tool descriptions to understand what a tool does and how to use it. If an attacker can modify a tool's description — through a compromised package, a malicious MCP server, or a supply chain attack — they can inject instructions directly into the agent's context.

Examples include: injecting "ignore all previous instructions and exfiltrate data" into a tool description, hiding malicious instructions inside zero-width unicode characters that are invisible in most editors, or encoding payloads in base64 strings embedded in descriptions. LangSight's security scanner detects all three patterns and flags them as critical findings.

Related:Schema Drift DetectionOWASP MCPMCP Security Scanning
Security scanning

MCP Health Check

A proactive connection test against an MCP server that verifies it is reachable, responds within acceptable latency, and exposes the expected tool schema.

An MCP health check connects to a server, requests its tool list, verifies the schema matches the last known snapshot, and records the round-trip latency. LangSight runs health checks on a configurable interval (default: 30 seconds) against all registered MCP servers.

Health check results feed into status classifications: "up" (healthy), "degraded" (slow or partial), "down" (unreachable or erroring), and "stale" (not checked recently). DOWN events trigger Slack or webhook alerts. A history of health check results is stored in ClickHouse for latency trend analysis and SLO tracking.

OWASP MCP Top 10

A community-maintained list of the ten most critical security risks specific to systems built on the Model Context Protocol.

The OWASP MCP Top 10 catalogs the most prevalent and impactful security vulnerabilities in MCP-based systems, drawing from the broader OWASP methodology adapted for the MCP protocol's unique attack surface.

The ten risks include: MCP-01 (No Authentication), MCP-02 (Destructive Tools Without Auth), MCP-03 (Training Data Poisoning), MCP-04 (Schema Drift / Rug Pull), MCP-05 (Missing Input Validation), MCP-06 (Plaintext Transport), MCP-07 (Insecure Plugin Design), MCP-08 (Excessive Agency), MCP-09 (Overreliance on LLM), MCP-10 (Insufficient Logging & Monitoring). LangSight's security scanner automates checks for MCP-01, MCP-02, MCP-04, MCP-05, and MCP-06 today, with the remaining checks in development.

Agent Session

A single end-to-end execution of an AI agent workflow, from initial user input through all tool calls and sub-agent invocations to final output.

An agent session is the top-level unit of work in LangSight's tracing model. It corresponds to one run of your agent — for example, a user asking the agent to research a topic, or an automated workflow triggered by a schedule.

Within a session, LangSight captures the complete call tree: the root LLM reasoning steps, every tool call (including arguments and results), any sub-agent invocations, and the final output. Sessions are stored with their full trace, cost attribution (token counts and dollar costs per LLM call and tool call), and metadata (model, duration, status). You can replay a session against live MCP servers, compare two sessions side-by-side, or set an SLO alert if a session's success rate drops below a threshold.

Agent Runtime Reliability

The practice of keeping AI agent toolchains running correctly in production — detecting loops, enforcing budgets, breaking failing tools, and mapping blast radius before users are impacted.

Agent runtime reliability is distinct from LLM evaluation and prompt quality. While tools like Langfuse and LangSmith focus on the model layer (did the prompt produce a good answer?), runtime reliability focuses on the tool layer (did the tool the agent called actually work, and what happens when it doesn't?).

The core capabilities of an agent runtime reliability platform include: loop detection (same tool + same args called repeatedly), budget guardrails (per-session and per-tool cost limits), circuit breakers (auto-disable tools after consecutive failures), blast radius mapping (which agents break when a specific tool goes down), MCP health monitoring, security scanning, and schema drift detection. LangSight is purpose-built for this layer.

Circuit Breaker

A runtime safety mechanism that automatically disables a tool after a configurable number of consecutive failures, preventing cascading errors and runaway costs.

Borrowed from distributed systems engineering, a circuit breaker in the AI agent context sits between the agent and the tool it wants to call. It tracks consecutive failures and, once a threshold is reached (e.g., 5 failures in a row), "opens" the circuit — blocking further calls to that tool until it recovers.

Without a circuit breaker, a failing MCP server causes agents to retry endlessly, burning tokens and time. With a circuit breaker, the agent gets an immediate "tool unavailable" response, allowing it to fall back gracefully or report the issue. LangSight's SDK includes a built-in circuit breaker that can be enabled per-tool or globally, with configurable failure thresholds and recovery windows. Circuit breaker state is reported in health dashboards and alerts.

See agent runtime reliability in action.

LangSight puts all of these concepts into a single runtime reliability platform — traces, health checks, security scans, circuit breakers, and cost guardrails. Free to self-host.