Unit Testing

Testing the deterministic pieces around the model in isolation — tool functions, schema parsing, prompt builders, output validators — with classic fast, exact-match assertions.

Why it matters

Most of an agent is ordinary code, and it should be tested like ordinary code. The model is nondeterministic, but the parse_tool_args, the calculator tool, and the JSON-schema validator are not — they have one right answer and belong in a fast unit suite. Pushing every check into slow, flaky end-to-end flow tests is a classic anti-pattern; unit tests catch the boring 80% of bugs in milliseconds.

How it works

Identify the deterministic surface, isolate it, and assert exact behavior. Mock the LLM so no network call happens.

Test targetAssertion
Tool functionadd(2,3) == 5, errors raise
Arg parsingbad JSON → handled, not crash
Output validatorrejects malformed schema
Prompt builderrenders expected string
Router (mock LLM)tool name → correct dispatch
  • Mock the model. Replace the LLM client with a stub returning a fixed tool_use/JSON, so you test your dispatch and parsing logic deterministically — no tokens, no flakiness.
  • Test the tools directly. A DB or API tool is just a function; unit-test its happy path, error path, and validation independently of any agent.
  • Validate structured output. Assert the parser accepts good payloads and rejects malformed ones — the guard that stops a bad model response crashing the loop.
  • Fast and hermetic. No network, runs in CI on every commit; this is where the test suite gives instant feedback.

Example

Unit-testing the dispatch layer with a stubbed model:

def test_router_dispatches_to_calculator():
    fake_llm = Stub(returns={"tool":"calculator","args":{"expr":"2+2"}})
    agent = Agent(llm=fake_llm, tools={"calculator": calc})
 
    out = agent.step("what is 2+2")
    assert out.tool_called == "calculator"
    assert out.result == 4               # deterministic, no real LLM
 
def test_parse_rejects_bad_json():
    with pytest.raises(ToolArgError):
        parse_tool_args('{"expr": ')      # malformed → handled error

These run in milliseconds and never touch a provider, unlike a flow test.

Pitfalls

  • Trying to unit-test the LLM’s words. Output text is nondeterministic — don’t assert exact phrasings here; use metric-based eval instead.
  • Real API calls in “unit” tests. That makes them integration tests: slow, flaky, costly — mock the client.
  • Skipping error paths. Tools fail (timeouts, bad input); test the failure branch, since that’s what the agent must recover from.
  • Over-mocking. Mocking so much that the test only verifies the mock proves nothing; test real logic, stub only the model and network.

See also