Setting SLOs for AI Agents: A Practical Guide
Your VP asks: "What is the reliability of our AI products?" You have no number to give. No success rate. No latency target. No error budget. Traditional SRE practices work for deterministic services. AI agents are non-deterministic. Here is how to adapt SLOs for a world where the same input can produce different outputs.

Why AI agents need SLOs
Service Level Objectives (SLOs) are the foundation of reliability engineering. They define what "reliable enough" means for a service, expressed as a measurable target over a time window. Google's SRE book popularized the concept for traditional services: "99.9% of requests complete within 300ms over a rolling 30-day window."
AI agents need SLOs for the same reason traditional services do: without a measurable target, reliability is a subjective judgment. "The agent seems to be working fine" is not an engineering statement. "The agent has a 96.3% success rate with p99 latency of 12 seconds over the past 7 days" is.
But AI agents are fundamentally different from traditional services. They are non-deterministic — the same input can produce different outputs. They have variable execution paths — one session might take 2 tool calls, another might take 15. They have failure modes that traditional services do not — loops, hallucinations, budget overruns. Standard SLOs (uptime, latency, error rate) are necessary but not sufficient.
Four SLO metrics for AI agents
Based on running agent systems in production, these four metrics capture the dimensions of reliability that matter most:
1. Success rate
Definition: The percentage of sessions that complete without failure. A session "succeeds" if the agent produces a final response without hitting a loop, budget limit, tool error, or timeout. A session "fails" if it is terminated by any of these guardrails.
Realistic targets: For well-tuned agents, 95-98% success rate is achievable. 99%+ is unrealistic for agents handling diverse, real-world inputs. Set the initial target conservatively (90%) and tighten as you gain confidence.
# LangSight SLO definition
{
"agent": "support-agent",
"metric": "success_rate",
"target": 0.95, # 95% of sessions succeed
"window": "7d", # evaluated over rolling 7 days
"exclude": ["test", "internal"] # exclude test sessions
}2. Latency (p99 end-to-end)
Definition: The 99th percentile end-to-end session duration — from the initial user request to the final agent response. This includes all LLM calls, all tool calls, and all processing time.
Realistic targets: Highly variable by use case. A simple FAQ agent should complete in under 5 seconds. A data analysis agent that makes 10 tool calls might take 30 seconds. Set the target based on your specific agent's expected behavior, not based on general benchmarks.
{
"agent": "support-agent",
"metric": "latency_p99",
"target_seconds": 15, # 99% of sessions complete in < 15s
"window": "7d"
}3. Loop rate
Definition: The percentage of sessions that trigger loop detection. A high loop rate indicates that the agent is frequently getting stuck, which wastes tokens and produces poor user experiences even if the loop is detected and terminated cleanly.
Realistic targets: Below 2% for a well-tuned agent. If loop rate exceeds 5%, investigate the root causes — typically a specific tool that returns confusing responses or a prompt that does not handle edge cases well.
{
"agent": "support-agent",
"metric": "loop_rate",
"target": 0.02, # < 2% of sessions trigger loops
"window": "7d"
}4. Budget adherence
Definition: The percentage of sessions that complete within their configured cost budget. If the budget is $1 per session, budget adherence measures how many sessions actually cost less than $1.
Realistic targets: 98%+ is achievable with properly configured budgets. The 2% that exceed the budget should be outliers (complex queries that legitimately need more tool calls), not systemic issues (loops, wrong model selection).
{
"agent": "support-agent",
"metric": "budget_adherence",
"target": 0.98, # 98% of sessions within budget
"budget_usd": 1.00, # per-session budget
"window": "7d"
}Setting realistic targets
The most common mistake when setting SLOs for AI agents is setting targets that are too aggressive. Engineers coming from traditional SRE set 99.9% targets because that is what they set for APIs. But agents are non-deterministic. A 99.9% success rate for an AI agent means only 1 in 1,000 sessions fails — which requires near-perfect prompt engineering, tools that never fail, and users who never ask edge-case questions.
Start with these guidelines:
| Metric | Conservative | Standard | Aggressive |
|---|---|---|---|
| Success rate | 90% | 95% | 98% |
| Latency p99 | 30s | 15s | 8s |
| Loop rate | < 5% | < 2% | < 0.5% |
| Budget adherence | 95% | 98% | 99.5% |
Start with conservative targets, measure for 2-4 weeks, then tighten based on actual data. Do not start with aggressive targets — you will immediately be in SLO violation and the SLOs will lose credibility.
Evaluation windows
SLOs are evaluated over rolling time windows. The window length determines how sensitive the SLO is to recent events:
- 1-hour window — highly sensitive to recent issues. Useful for real-time dashboards and operational monitoring. A 30-minute outage immediately violates the SLO.
- 24-hour window — balances sensitivity with stability. Good for daily standup reporting. A brief outage affects the SLO but does not dominate it.
- 7-day window — the standard for SLO reporting. Smooths out transient issues. Shows the trend. This is the window your VP should see.
- 30-day window — used for error budget calculations and quarterly reviews. Shows the big picture.
LangSight evaluates SLOs across all four windows simultaneously. The dashboard shows the current value for each window, so you can see both the real-time state (1h) and the trend (7d/30d).
Error budgets: what to do when SLOs breach
An error budget is the inverse of the SLO target: if success rate target is 95%, the error budget is 5% — you can tolerate 5% of sessions failing. When the error budget is exhausted (actual failure rate exceeds 5%), you need a response policy.
Borrowed from Google's SRE practices, error budget policies for agents should include:
- Alert escalation: When error budget drops below 50%, alert the team. Below 20%, page the on-call. At 0%, freeze deployments until the root cause is fixed.
- Deployment freeze: When error budget is exhausted, stop deploying new agent changes until the failure rate recovers. This prevents compounding issues.
- Postmortem trigger: When error budget is exhausted, automatically create a postmortem document with the relevant session data, failure categories, and timeline.
- Reliability sprint: If the error budget is consistently tight, allocate engineering time specifically to reliability improvements — better error handling, improved prompts, additional guardrails.
# .langsight.yaml — error budget policy
slos:
- agent: support-agent
success_rate:
target: 0.95
window: 7d
error_budget_policy:
alert_at: [0.50, 0.20, 0.0] # alert at 50%, 20%, 0% remaining
freeze_deploys_at: 0.0 # freeze at budget exhaustion
postmortem_at: 0.0 # auto-create postmortemReporting to stakeholders
The SLO dashboard is the answer to "what is the reliability of our AI products?" For each agent, it shows:
- Current success rate vs target (with trend arrow)
- Current latency p99 vs target
- Error budget remaining (as a percentage and as remaining failure count)
- Top failure categories (loop, tool error, budget exceeded, timeout)
- 7-day and 30-day trends
This gives stakeholders a quantitative, trustworthy answer. Not "the agent seems fine" but "the support agent has a 96.3% success rate against a 95% target, with 62% of its error budget remaining for the 30-day window."
Key takeaways
- AI agents need SLOs adapted for non-determinism. Traditional uptime is not enough. Track success rate, latency p99, loop rate, and budget adherence.
- Start conservative, tighten with data. Begin with 90% success rate, measure for 2-4 weeks, then raise the target based on actual performance. Do not start at 99%.
- Use multiple evaluation windows. 1-hour for operations, 7-day for team reporting, 30-day for error budgets. Each window serves a different audience.
- Error budget policies drive action. When the budget is exhausted, freeze deployments, trigger postmortems, and allocate reliability sprints. Without policies, SLOs are just numbers.
- SLOs answer the reliability question. When your VP asks "how reliable are our AI products?", the SLO dashboard provides a quantitative, defensible answer.
Related articles
- AI Agent Loop Detection — Loop rate is one of the four SLO metrics. Learn how to detect and prevent the most common agent failure mode.
- AI Agent Cost Attribution — Budget adherence is an SLO metric. Per-session cost tracking makes it measurable.
- How to Monitor MCP Servers in Production — MCP health data feeds into agent availability SLOs.
- Blast Radius Mapping — Understand dependencies to set appropriate per-agent SLO targets based on tool reliability.
Set and track SLOs for your agents
LangSight tracks success rate, latency, loop rate, and budget adherence across multiple time windows. Error budgets with automated alerting and deployment freeze policies. Self-host free, Apache 2.0.
Get started →