Unit Testing
Testing the deterministic pieces around the model in isolation — tool functions, schema parsing, prompt builders, output validators — with classic fast, exact-match assertions.
Why it matters
Most of an agent is ordinary code, and it should be tested like ordinary code. The model is nondeterministic, but the parse_tool_args, the calculator tool, and the JSON-schema validator are not — they have one right answer and belong in a fast unit suite. Pushing every check into slow, flaky end-to-end flow tests is a classic anti-pattern; unit tests catch the boring 80% of bugs in milliseconds.
How it works
Identify the deterministic surface, isolate it, and assert exact behavior. Mock the LLM so no network call happens.
| Test target | Assertion |
|---|---|
| Tool function | add(2,3) == 5, errors raise |
| Arg parsing | bad JSON → handled, not crash |
| Output validator | rejects malformed schema |
| Prompt builder | renders expected string |
| Router (mock LLM) | tool name → correct dispatch |
- Mock the model. Replace the LLM client with a stub returning a fixed
tool_use/JSON, so you test your dispatch and parsing logic deterministically — no tokens, no flakiness. - Test the tools directly. A DB or API tool is just a function; unit-test its happy path, error path, and validation independently of any agent.
- Validate structured output. Assert the parser accepts good payloads and rejects malformed ones — the guard that stops a bad model response crashing the loop.
- Fast and hermetic. No network, runs in CI on every commit; this is where the test suite gives instant feedback.
Example
Unit-testing the dispatch layer with a stubbed model:
def test_router_dispatches_to_calculator():
fake_llm = Stub(returns={"tool":"calculator","args":{"expr":"2+2"}})
agent = Agent(llm=fake_llm, tools={"calculator": calc})
out = agent.step("what is 2+2")
assert out.tool_called == "calculator"
assert out.result == 4 # deterministic, no real LLM
def test_parse_rejects_bad_json():
with pytest.raises(ToolArgError):
parse_tool_args('{"expr": ') # malformed → handled errorThese run in milliseconds and never touch a provider, unlike a flow test.
Pitfalls
- Trying to unit-test the LLM’s words. Output text is nondeterministic — don’t assert exact phrasings here; use metric-based eval instead.
- Real API calls in “unit” tests. That makes them integration tests: slow, flaky, costly — mock the client.
- Skipping error paths. Tools fail (timeouts, bad input); test the failure branch, since that’s what the agent must recover from.
- Over-mocking. Mocking so much that the test only verifies the mock proves nothing; test real logic, stub only the model and network.