Your agent failed.
Which tool broke — and
how do we stop it next time?
Detect loops. Enforce budgets. Break failing tools. Map blast radius. For MCP servers: health checks, security scanning, and schema drift detection.
Where LangSight fits
What question are you
trying to answer?
Langfuse watches the brain. LangSight watches the hands. Use them together — they never overlap.
| Question | Best tool |
|---|---|
| Did the prompt/model perform well? | LangWatch / Langfuse / LangSmith |
| Should I change prompts or eval policy? | LangWatch / Langfuse / LangSmith |
| Is my server CPU/memory healthy? | Datadog / New Relic |
| → Which tool call failed in production? | LangSight |
| → Is my agent stuck in a loop? | LangSight |
| → Is an MCP server unhealthy or drifting? | LangSight |
| → Is an MCP server exposed or risky? | LangSight |
| → Why did this session cost $47 instead of $3? | LangSight |
| → If this tool goes down, which agents break? | LangSight |
The problem
LLM quality is only
half the problem.
Teams already have ways to inspect prompts and eval scores. What they still cannot answer fast enough:
Agent stuck in a loop
Your agent retries the same tool with the same args 47 times. Burns $200. Produces nothing. Nobody detects it until the invoice arrives.
Tool failure cascades across agents
postgres-mcp goes down. 3 agents depend on it. All sessions fail. You don't know which agents are affected or how many users are impacted.
Cost explosion with no guardrails
A sub-agent retries geocoding-mcp endlessly. At $0.005/call, that's $1,800/week. No budget limit existed to stop it. You need tool-level cost control.
MCP server changed and nobody noticed
Schema drifted. A field was renamed. Auth expired. The agent keeps calling, gets corrupted data, and hallucinates downstream. Silent until users complain.
The solution
Prevent. Detect.
Monitor. Map.
Prevent
Stop loops, enforce budgets, break failing tools — before users notice. Configure thresholds per-agent from the dashboard. No code change needed after initial SDK setup.
from langsight.sdk import LangSightClient
client = LangSightClient(
url="http://localhost:8000",
loop_detection=True, # same tool+args 3x → stop
max_cost_usd=1.00, # budget limit per session
max_steps=25, # step limit
circuit_breaker=True, # auto-disable after 5 failures
)
# Override thresholds per-agent from the dashboard —
# Settings → Prevention → Add agent override
# No code change needed.Detect
See exactly which tool failed, when, and why. Every session gets a health tag: success, loop_detected, budget_exceeded, tool_failure. Filter and investigate instantly.
$ langsight sessions --id sess-f2a9b1 sess-f2a9b1 (support-agent) [LOOP_DETECTED] ├── jira-mcp/get_issue 89ms ✓ ├── postgres-mcp/query 42ms ✓ ├── → billing-agent handoff │ ├── crm-mcp/update 120ms ✓ │ └── slack-mcp/notify — ✗ timeout Root cause: slack-mcp timed out at 14:32
Monitor
MCP health checks, security scanning, schema drift detection. Proactive — catches problems before agents start failing. Alerts via Slack, OpsGenie, PagerDuty.
$ langsight mcp-health Server Status Latency Schema Circuit snowflake-mcp ✅ UP 142ms Stable closed slack-mcp ⚠️ DEG 1,240ms Stable closed jira-mcp ❌ DOWN — — open postgres-mcp ✅ UP 31ms Changed closed
Map
Lineage shows which agents call which tools. Blast radius shows what breaks when a tool goes down. Impact alerts include affected agents and session counts.
postgres-mcp ❌ DOWN Blast radius: support-agent 200 sessions/day HIGH billing-agent 50 sessions/day MEDIUM data-agent 10 sessions/day LOW Total: ~260 sessions/day affected Circuit breaker: active
Get started
Zero to traced
in 5 minutes.
Install & discover
30 seconds
pip install langsight langsight init # Auto-discovered 4 MCP servers
Instrument your agent
2 lines of code
from langsight.sdk import LangSightClient client = LangSightClient(url="...") traced = client.wrap(mcp, server_name="pg")
See everything
real-time
langsight sessions langsight mcp-health langsight security-scan langsight costs --hours 24
And more
Built for production.
Prevention Guardrails
v0.3Loop detection, budget limits, and circuit breakers. Configure thresholds per-agent from the dashboard — no code change needed after initial SDK setup.
Multi-Agent Call Trees
Coreparent_span_id links sub-agent calls across any depth. See the path from orchestrator to leaf tool.
Session Replay
v0.2Re-execute any session against live MCP servers. Compare two runs side-by-side to see what changed.
Anomaly Detection
v0.2Z-score analysis against 7-day baseline. Warning at |z|>=2, critical at |z|>=3. No manual thresholds.
Agent SLO Tracking
v0.2Define success_rate and latency_p99 targets per agent. Get alerted before you breach availability.
AI Root Cause Analysis
4 LLMslangsight investigate sends evidence to Claude, GPT-4o, Gemini, or Ollama and returns remediation steps.
Prometheus Metrics
v0.2Native /metrics endpoint. Plug into your existing Grafana stack. Request counts, latencies, SSE connections.
Integrations
Drop into any framework.
Langfuse watches the brain. LangSight watches the hands.
Use alongside Langfuse, LangWatch, or LangSmith. They trace model reasoning. LangSight guards the tool layer — loops, budgets, health, security, blast radius.
Your data. Your infra.
No vendor dependency.
Self-host on your own infrastructure. No data ever leaves your network. No paid tiers. No gated features. No usage limits.
Your data stays yours
PostgreSQL + ClickHouse via docker compose up. Both fully under your control. No telemetry phoning home.
No vendor lock-in
Apache 2.0 — fork it, modify it, embed it, sell it. No restrictions.
5-minute setup
One script generates secrets, starts 5 containers, seeds demo data. You're looking at traces before your coffee is ready.
Own the runtime layer
of your agent systems.
If your agents depend on tools, LangSight keeps them reliable, safe, and within budget.
Prevent loops. Enforce budgets. Break failing tools. Map blast radius.