Resilience & Testing Agentic Systems
Agentic systems are inherently non-deterministic. How do you test, monitor, and trust something that gives different answers every time?
The Determinism Problem
Traditional software is deterministic:
f(x) = y (always the same result)
sort([3,1,2]) = [1,2,3] (every time, forever)
Agentic systems are non-deterministic:
agent("fix this bug") = result_1 (Monday)
agent("fix this bug") = result_2 (Tuesday, different approach)
agent("fix this bug") = result_3 (Wednesday, yet another way)
All three results might be correct, but they're different. This breaks traditional testing:
// Traditional test — works fine
expect(sort([3,1,2])).toEqual([1,2,3]);
// Agent test — this approach fails
expect(agent("fix the bug")).toEqual(exactExpectedCode);
// ❌ Agent produces valid but different code each time
The fundamental challenge: You can't test for specific outputs. You must test for correct behavior.
The Model Update Problem
What happens when the underlying model gets updated?
Scenario: You deploy an agent using Claude Sonnet 3.5. It works perfectly. Then Anthropic releases Sonnet 4.0.
What can change:
- Response format might shift slightly
- Reasoning patterns may differ
- Tool call patterns could change
- Edge case handling may improve or regress
- Token usage (and costs) may change
What doesn't change:
- The API contract (input/output format)
- Tool definitions and capabilities
- Your system prompt and guardrails
Mitigation strategies:
- Pin model versions in production (
claude-sonnet-3.5-20241022) - Test new versions in staging before deploying
- Use semantic validation (does it work?) not syntactic validation (is it identical?)
- Maintain a test suite of representative scenarios
Functional Testing for Agents
Test behaviors, not specific outputs:
Level 1 — Output validation:
// Does the agent produce valid output?
const result = await agent("Create a user API endpoint");
expect(result.files).toContain('routes/users.js');
expect(result.code).toContain('router.get');
expect(result.code).not.toContain('require('); // Should use ESM
Level 2 — Behavioral testing:
// Does the agent handle edge cases correctly?
const result = await agent("What's the status of order #999999");
// Order doesn't exist — agent should handle gracefully
expect(result.response).toContain("not found");
expect(result.actions).not.toContain("delete");
Level 3 — Integration testing:
// Does the full pipeline work end-to-end?
await agent("Add pagination to /api/users");
const response = await fetch('/api/users?page=2&limit=10');
expect(response.status).toBe(200);
const data = await response.json();
expect(data.items.length).toBeLessThanOrEqual(10);
The 100,000 Identical Requests Experiment
A thought experiment that reveals agent reliability:
What happens if you send the exact same request to an agent 100,000 times?
Expected results:
- 95,000 (95%) — correct, well-formatted response
- 3,000 (3%) — correct but unusual format
- 1,500 (1.5%) — partially correct, missing details
- 400 (0.4%) — wrong answer but plausible
- 100 (0.1%) — completely off-track or hallucinated
What this means for production:
- At 1,000 requests/day: ~1 bad response per day
- At 10,000 requests/day: ~10 bad responses per day
- At 100,000 requests/day: ~100 bad responses per day
Mitigation:
- Output validation (catch structural issues)
- Confidence scoring (flag low-confidence responses)
- Human review queue (route uncertain responses to humans)
- Retry with different temperature/model (reduce randomness)
Monitoring Agentic Systems
You can't debug what you can't see:
Key metrics to track:
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Success rate | % of requests that produce valid output | < 95% |
| Latency (p95) | How long actions take | > 30 seconds |
| Cost per request | Budget health | > $0.50 |
| Tool call count | Efficiency (are agents looping?) | > 20 per request |
| Retry rate | How often agents self-correct | > 30% |
| Hallucination rate | Output quality | > 2% |
| Escalation rate | How often humans intervene | Trend up |
Logging every agent step:
{
"request_id": "abc123",
"step": 3,
"action": "tool_call",
"tool": "read_file",
"args": { "path": "src/routes/users.js" },
"result": "success",
"tokens_used": 450,
"duration_ms": 230
}
This trace lets you reconstruct exactly what the agent did and where it went wrong.
Building Resilience
Patterns for reliable agentic systems:
1. Circuit Breakers
If agent fails 3 times in 5 minutes:
→ Stop sending requests to that agent
→ Fallback to simpler (non-agentic) path
→ Alert on-call team
→ Retry after cooldown period
2. Guardrails
Before executing any tool call:
→ Is this tool allowed for this user/tenant?
→ Does this action exceed cost limits?
→ Is this a destructive action? (delete, overwrite)
→ If destructive: require confirmation or human approval
3. Idempotency
Agent actions should be safe to retry:
→ Use upsert instead of insert
→ Check state before modifying
→ Log all actions for rollback
4. Graceful degradation
If frontier model is down:
→ Fall back to a smaller model
→ If all models are down: return cached response or error
→ Never crash silently
The Testing Pyramid for Agents
Adapted from traditional software testing:
╱╲
╱ ╲ Human review
╱ 5% ╲ (production, sampled)
╱──────╲
╱ ╲ End-to-end scenarios
╱ 15% ╲ (staging, full pipeline)
╱────────────╲
╱ ╲ Behavioral tests
╱ 30% ╲ (automated, key behaviors)
╱──────────────────╲
╱ ╲ Unit tests on tools & prompts
╱ 50% ╲ (fast, comprehensive)
╱──────────────────────╲
Bottom layer (50%): Test individual tools, prompt templates, and parsing logic. These are deterministic and fast.
Middle layers (45%): Test agent behaviors against scenarios. "Given this input, the output should contain X and not contain Y."
Top layer (5%): Sample production traffic for human review. Catch issues that automated tests miss.
---quiz question: Why can't you use traditional unit tests for agentic systems? options:
- { text: "Because agents are too fast to test", correct: false }
- { text: "Because agents are non-deterministic — they produce valid but different outputs each time", correct: true }
- { text: "Because agents don't have functions to test", correct: false } feedback: Agents produce different but valid outputs for the same input. Traditional tests compare against exact expected values, which fails for non-deterministic systems. Instead, test for correct behavior (does it work?) rather than specific output (is it identical?).
---quiz question: What happens if you send 100,000 identical requests to a well-built agent? options:
- { text: "All 100,000 responses will be identical", correct: false }
- { text: "About 95% will be correct, with ~0.1% completely off-track — requiring monitoring and validation", correct: true }
- { text: "Most will fail due to rate limiting", correct: false } feedback: Even well-built agents have a small failure rate (~0.1% completely wrong, ~1.5% partially correct). At scale, this means you need output validation, confidence scoring, and human review queues to catch the inevitable bad responses.
---quiz question: What is the most important monitoring metric for an agentic system? options:
- { text: "Number of API calls made", correct: false }
- { text: "The combination of success rate, latency, cost, and tool call count", correct: true }
- { text: "The model version being used", correct: false } feedback: No single metric tells the full story. Success rate catches quality issues, latency catches performance problems, cost catches budget overruns, and tool call count catches agents stuck in loops. Monitor all of them together.