Resilience & Testing Agentic Systems

Agentic systems are inherently non-deterministic. How do you test, monitor, and trust something that gives different answers every time?

The Determinism Problem

Traditional software is deterministic:

f(x) = y    (always the same result)
sort([3,1,2]) = [1,2,3]   (every time, forever)

Agentic systems are non-deterministic:

agent("fix this bug") = result_1  (Monday)
agent("fix this bug") = result_2  (Tuesday, different approach)
agent("fix this bug") = result_3  (Wednesday, yet another way)

All three results might be correct, but they're different. This breaks traditional testing:

// Traditional test — works fine
expect(sort([3,1,2])).toEqual([1,2,3]);

// Agent test — this approach fails
expect(agent("fix the bug")).toEqual(exactExpectedCode);
// ❌ Agent produces valid but different code each time

The fundamental challenge: You can't test for specific outputs. You must test for correct behavior.

The Model Update Problem

What happens when the underlying model gets updated?

Scenario: You deploy an agent using Claude Sonnet 3.5. It works perfectly. Then Anthropic releases Sonnet 4.0.

What can change:

Response format might shift slightly
Reasoning patterns may differ
Tool call patterns could change
Edge case handling may improve or regress
Token usage (and costs) may change

What doesn't change:

The API contract (input/output format)
Tool definitions and capabilities
Your system prompt and guardrails

Mitigation strategies:

Pin model versions in production (claude-sonnet-3.5-20241022)
Test new versions in staging before deploying
Use semantic validation (does it work?) not syntactic validation (is it identical?)
Maintain a test suite of representative scenarios

Functional Testing for Agents

Test behaviors, not specific outputs:

Level 1 — Output validation:

// Does the agent produce valid output?
const result = await agent("Create a user API endpoint");
expect(result.files).toContain('routes/users.js');
expect(result.code).toContain('router.get');
expect(result.code).not.toContain('require(');  // Should use ESM

Level 2 — Behavioral testing:

// Does the agent handle edge cases correctly?
const result = await agent("What's the status of order #999999");
// Order doesn't exist — agent should handle gracefully
expect(result.response).toContain("not found");
expect(result.actions).not.toContain("delete");

Level 3 — Integration testing:

// Does the full pipeline work end-to-end?
await agent("Add pagination to /api/users");
const response = await fetch('/api/users?page=2&limit=10');
expect(response.status).toBe(200);
const data = await response.json();
expect(data.items.length).toBeLessThanOrEqual(10);

The 100,000 Identical Requests Experiment

A thought experiment that reveals agent reliability:

What happens if you send the exact same request to an agent 100,000 times?

Expected results:

95,000 (95%) — correct, well-formatted response
3,000 (3%) — correct but unusual format
1,500 (1.5%) — partially correct, missing details
400 (0.4%) — wrong answer but plausible
100 (0.1%) — completely off-track or hallucinated

What this means for production:

At 1,000 requests/day: ~1 bad response per day
At 10,000 requests/day: ~10 bad responses per day
At 100,000 requests/day: ~100 bad responses per day

Mitigation:

Output validation (catch structural issues)
Confidence scoring (flag low-confidence responses)
Human review queue (route uncertain responses to humans)
Retry with different temperature/model (reduce randomness)

Monitoring Agentic Systems

You can't debug what you can't see:

Key metrics to track:

Metric	What It Tells You	Alert Threshold
Success rate	% of requests that produce valid output	< 95%
Latency (p95)	How long actions take	> 30 seconds
Cost per request	Budget health	> $0.50
Tool call count	Efficiency (are agents looping?)	> 20 per request
Retry rate	How often agents self-correct	> 30%
Hallucination rate	Output quality	> 2%
Escalation rate	How often humans intervene	Trend up

Logging every agent step:

{
  "request_id": "abc123",
  "step": 3,
  "action": "tool_call",
  "tool": "read_file",
  "args": { "path": "src/routes/users.js" },
  "result": "success",
  "tokens_used": 450,
  "duration_ms": 230
}

This trace lets you reconstruct exactly what the agent did and where it went wrong.

Building Resilience

Patterns for reliable agentic systems:

1. Circuit Breakers

If agent fails 3 times in 5 minutes:
  → Stop sending requests to that agent
  → Fallback to simpler (non-agentic) path
  → Alert on-call team
  → Retry after cooldown period

2. Guardrails

Before executing any tool call:
  → Is this tool allowed for this user/tenant?
  → Does this action exceed cost limits?
  → Is this a destructive action? (delete, overwrite)
  → If destructive: require confirmation or human approval

3. Idempotency

Agent actions should be safe to retry:
  → Use upsert instead of insert
  → Check state before modifying
  → Log all actions for rollback

4. Graceful degradation

If frontier model is down:
  → Fall back to a smaller model
  → If all models are down: return cached response or error
  → Never crash silently

The Testing Pyramid for Agents

Adapted from traditional software testing:

         ╱╲
        ╱  ╲       Human review
       ╱ 5% ╲      (production, sampled)
      ╱──────╲
     ╱        ╲     End-to-end scenarios
    ╱   15%    ╲    (staging, full pipeline)
   ╱────────────╲
  ╱              ╲   Behavioral tests
 ╱     30%        ╲  (automated, key behaviors)
╱──────────────────╲
╱                    ╲ Unit tests on tools & prompts
╱        50%          ╲ (fast, comprehensive)
╱──────────────────────╲

Bottom layer (50%): Test individual tools, prompt templates, and parsing logic. These are deterministic and fast.

Middle layers (45%): Test agent behaviors against scenarios. "Given this input, the output should contain X and not contain Y."

Top layer (5%): Sample production traffic for human review. Catch issues that automated tests miss.

---quiz question: Why can't you use traditional unit tests for agentic systems? options:

{ text: "Because agents are too fast to test", correct: false }
{ text: "Because agents are non-deterministic — they produce valid but different outputs each time", correct: true }
{ text: "Because agents don't have functions to test", correct: false } feedback: Agents produce different but valid outputs for the same input. Traditional tests compare against exact expected values, which fails for non-deterministic systems. Instead, test for correct behavior (does it work?) rather than specific output (is it identical?).

---quiz question: What happens if you send 100,000 identical requests to a well-built agent? options:

{ text: "All 100,000 responses will be identical", correct: false }
{ text: "About 95% will be correct, with ~0.1% completely off-track — requiring monitoring and validation", correct: true }
{ text: "Most will fail due to rate limiting", correct: false } feedback: Even well-built agents have a small failure rate (~0.1% completely wrong, ~1.5% partially correct). At scale, this means you need output validation, confidence scoring, and human review queues to catch the inevitable bad responses.

---quiz question: What is the most important monitoring metric for an agentic system? options:

{ text: "Number of API calls made", correct: false }
{ text: "The combination of success rate, latency, cost, and tool call count", correct: true }
{ text: "The model version being used", correct: false } feedback: No single metric tells the full story. Success rate catches quality issues, latency catches performance problems, cost catches budget overruns, and tool call count catches agents stuck in loops. Monitor all of them together.