Testing your agent
Agents at Work — CC BY 4.0
Guardrails are a claim: this agent behaves. Testing is how you find out whether the claim is true — before the agent is loose on real work, and then again as the tools underneath it change. It’s Anchor 2, continuous improvement, in its most concrete form: you don’t trust the agent because you built it carefully; you trust it because you checked, and you keep checking.
There are two kinds of test, and which you need depends on what the agent touches.
Test one — accuracy spot-checks (for agents that handle your work)
For a Bookkeeper reconciling accounts, a Competitive Analyst pulling prices, a Market Analyst summarising trends, the failure is a wrong figure or a laundered guess. The test is proportionate checking: take a sample of the agent’s output and verify it against the source yourself.
- Sample deliberately. A handful every run, plus the edge cases — the biggest numbers, the odd-looking ones, the ones a mistake would cost most.
- Check against the source, not the agent’s own summary. The point is to catch the agent being confidently wrong, and it can’t catch itself.
- Check in proportion to the cost of a mistake. A misfiled internal note needs a glance; a figure going to a customer or the tax department needs real checking.
None of this is exotic. It’s the verification habit from your guardrails, applied to a worker who never gets tired and never tells you when it’s unsure unless you built it to.
Test two — adverse-impact testing (for agents that affect people)
Here’s the test that matters most and gets done least. When an agent judges people — the Recruiter, or anything that sorts, scores, or filters humans — you cannot tell whether it’s fair by looking at it. Tier 2 showed why: bias rides in on proxies you redacted around, and 60% of people miss a 10% skew sitting right in front of them. Eyeballing doesn’t work. Measuring does.
The name-swap probe — run it on your own agent. Take one application. Run it through. Now change only the name — swap a male name for a female one, an obviously Pākehā name for an obviously Māori or Pasifika or Asian one — change nothing else, and run it again. Does the score move? Do it across a batch. If identity you thought you’d stripped still moves the outcome, you’ve just watched the proxy leakage from Tier 2 happen in your own build. Document what you find.
Adverse-impact testing — measure the outcomes across groups. Don’t test the agent’s intentions; test its results. Look at who it advances and who it filters out, broken down by group, over a real batch. If one group is selected at a much lower rate than another, you have adverse impact — regardless of whether anyone intended it, and regardless of how fair the criteria looked on paper.
A widely-used rule of thumb for “much lower” is the four-fifths (80%) rule: if a group’s selection rate is under 80% of the highest group’s, that’s the established flag for adverse impact. Be clear about what this is: it’s a US employment-law diagnostic, a practical threshold from American practice — useful as a measuring stick, not as New Zealand law. In New Zealand the legal frame is indirect discrimination under the Human Rights Act: a practice that’s neutral on its face but falls disproportionately on a protected group can be unlawful even with no intent to discriminate. The four-fifths rule is a handy way to notice the problem; the HRA is the reason it matters here. (General education, not legal advice.)
What the test is for
Testing isn’t a gate you pass once and forget. It’s the thing that:
- Catches proxy leakage design can’t — the only way to see the bias that survived your redaction.
- Makes the human gate real — a reviewer backed by “we measured, and it’s skewing against this group” can actually resist automation bias. A reviewer with only a glance can’t.
- Keeps up with a moving field — the models under your agent change. A test you can re-run is how you know last month’s “fine” is still fine.
And sometimes the test is the thing that tells you to stop. If you measure, and the skew won’t come out no matter what you adjust, that’s not a failed build — that’s the build teaching you the honest answer the Recruiter is about: some decisions about people shouldn’t be automated at all.
Take a people-affecting agent you might build. Could you actually run the name-swap probe on it — do you have the data, and would you look at the result honestly if it came back skewed? If not, that’s worth knowing before you build it, not after.
Next
Enough theory — we build. Two agents with Claude Code: one you build to work, and one you build to watch fail.
Shared freely, in good faith. If it's been of value, a koha toward development and running costs is warmly welcomed.
Leave a koha →