Guardrails you can write
Agents at Work — CC BY 4.0
If you took the Working with Claude course, you wrote yourself four rules: how you’ll verify, what you won’t paste, review before it’s sent, and staying the one who decides. Two of those lived as habits — things you do, in the moment, at the keyboard.
An agent breaks that arrangement, and it’s worth being clear about how. The agent acts unwatched. You aren’t at the keyboard when it runs; there is no moment for a habit to fire. So every guardrail that used to live in your head now has to be written into the agent and enforced by the setup — because “I’ll check it at the time” isn’t available when there’s no one there at the time.
That’s the whole lesson: the same four rules, moved out of your head and into the build.
From habits to written rules
1. What it must verify — and how it flags doubt. Your verification habit becomes the agent’s instructions. Tell it, in writing, to show its evidence, to say plainly when it isn’t sure, and never to fill a gap with a confident guess. An agent that returns “I couldn’t find this — flagging for you” is doing its job; one that invents a plausible answer to seem finished is the dangerous one. You can’t verify in the moment, so you build an agent that surfaces what needs verifying.
2. What it must never touch or send. Your “what I won’t paste” rule becomes a hard limit in the wiring — and this is where Tier 2’s least-privilege work becomes a written rule, not just a setting. Spell out what’s off-limits: personal data it may not send onward, accounts it may not write to, actions it may not take. With a person, “don’t paste client data” is a reminder. With an agent, it has to be a wall, because there’s no one to remind.
3. Stop and wait — the gate as a rule it can’t pass. “Review before it’s sent” stops being something you remember to do and becomes a hard stop the agent cannot walk through. The binding verbs — send, pay, post, reject, commit — sit on the far side of a gate where the agent prepares and then waits for a person. Not “usually checks with me.” Cannot proceed without me. Build it so the default is stop, and going forward takes a human.
4. It stays advisory where it matters. “Stay the one deciding” is the rule underneath all of them, and it’s Anchor 3 — you answer for it. Where a call carries weight — money, someone’s rights, a written commitment — the agent surfaces evidence and a person decides. If you couldn’t defend an outcome without saying “the agent did it,” the guardrail wasn’t there.
Why written-and-enforced beats well-intentioned
There’s a reason to prefer a rule the system holds over one you mean to keep. A written guardrail can be read, checked, handed to whoever runs the agent next, and — the point of the next lesson — tested. A good intention can’t be any of those things. When an agent runs a hundred times while you sleep, “I’ll keep an eye on it” is not a guardrail; the rules you wrote into it and the stops you built are.
And write them so you’d stand behind each one. If a guardrail is something you couldn’t say out loud to the person the agent affects, that’s the signal to change the guardrail, not to hide it.
The build move
Before an agent goes anywhere near live work, write its guardrails down — plainly, in four headings you can defend:
- It verifies / flags doubt by… (shows evidence, says “not sure”, never guesses)
- It must never… (the off-limits data and actions — the wall)
- It stops and waits before… (the binding verbs — the gate)
- A person decides… (which calls stay human)
Keep it short enough to read in a minute and specific enough to test. These four are the contract the agent runs under — and, not by accident, they’re your own four anchors turned into build rules: learn what it does, improve it as you go, keep it good, and make sure it serves the person on the other side.
Take the agent you’d build. Write its “it must never…” line — the single action that, if it took it unprompted, you’d most regret. Now: is that currently a wall in the setup, or just something you’re hoping it won’t do?
Next
Written guardrails are a claim. Testing is how you find out whether they hold — and, for agents that touch people, how you catch the bias that design alone can’t see.
Shared freely, in good faith. If it's been of value, a koha toward development and running costs is warmly welcomed.
Leave a koha →