When Your AI Assistant Nearly Destroys What It Was Hired to Fix

At 11pm on a Friday night, I asked my AI coding assistant to fix user roles in one of our community tenants. The assistant — Claude Opus 4.6, one of the most capable AI models available — produced a detailed analysis. It cited specific database IDs, referenced exact line numbers in our codebase, used precise forensic language. It wrote a fix script, ran a dry run that "confirmed all three issues exactly as investigated," and told me the work was complete.

There was just one problem: the fix would have permanently locked me out of my own community.

The "orphan user" it planned to delete was my login account. The "duplicate" it planned to merge was a separate tenant identity — by design. The "bug" it identified in the invitation flow was actually correct multi-tenant architecture. If I had run the script with the --apply flag instead of just the dry run, I would have lost administrator access to the herber community with no way to recover it without direct database intervention.

I caught the error because I did something the AI could never do: I opened a browser, navigated to the login page, and watched my password manager autofill. The account it declared "non-functional" worked perfectly.

This Is Not a Story About a Bug

Every AI system has bugs. This is a story about a psychological dynamic that is more dangerous than any technical failure — and that gets worse as AI systems get better.

The analysis was wrong, but it looked right. Not approximately right. Not partially right. It had every surface marker of thoroughness: precision, internal consistency, validation, confidence. Its confidence substituted for its correctness.

And here is the part that keeps me up at night: I almost didn't check.

My immediately preceding experiences over weeks of working with this AI assistant had been uniformly positive. It had fixed dozens of real bugs, implemented complex features, caught errors I had missed. I had developed the same relationship with it that I have with my kitchen tap — I expected it to work, and I was right to expect that, because it almost always did.

The KPMG/University of Melbourne Global Trust Study (2025), surveying 48,000+ people across 47 countries, found that 66% of respondents use AI regularly without evaluating accuracy. Not because they are naive. Not because they lack critical thinking. Because their experience tells them the AI is usually right, and checking takes effort they could spend on something else.

That is exactly where I was. The psychological cost of verification exceeded my assessment of risk — because the AI had successfully lowered my risk assessment through a track record of competence.

The Inverse Scaling Problem

Here is the finding that should concern anyone building or using AI systems: more capable models produce more dangerous errors.

OpenAI's own system card for o3 (April 2025) showed that their reasoning model hallucinates 33% of the time — double the rate of its predecessor o1 (16%). Their smaller o4-mini model hallucinates at 48%. OpenAI's September 2025 paper "Why Language Models Hallucinate" explains the mechanism: next-token training objectives reward confident guessing over calibrated uncertainty. Models learn to bluff because they are graded on fluency with no mechanism to express "I don't know."

This creates an inverse scaling dynamic that is the opposite of what intuition suggests:

Better models are faster and more capable — users rely on them more
Users who rely more verify less — errors go undetected longer
Undetected errors compound — the eventual failure is more severe
More capable models produce more confident wrong answers — the error is harder to detect when you do look

Opus 4.6 is measurably more capable than its predecessors. It completed in minutes what earlier models would have taken hours. But the speed and fluency made the error harder to catch, not easier. The detailed analysis — with its specific ObjectIDs, its line number references, its logical chain from diagnosis to fix — was a more convincing wrong answer than a less capable model could have produced.

The Verification Paradox

After discovering the near-miss, I asked the AI to write an audit script to verify tenant configurations. The audit script, like the fix script, used the same flawed understanding of what makes a user "functional." The verification tool shared the same blind spot as the tool it was verifying.

This is what researchers call the verification paradox. As the Generative AI Paradox paper (arXiv, January 2026) puts it: "The most consequential risk is the progressive erosion of shared epistemic ground." When you use AI to verify AI, you get circular trust. The check confirms the error because the check was written by the same system that produced the error.

In fairy tale terms, which I find increasingly apt: Lisa has a hole in her bucket. And Henry's suggestion to fix it requires the very bucket that is broken.

Anthropic itself published "Building and Evaluating Alignment Auditing Agents" in 2025, acknowledging the circularity challenge of using AI to verify AI. Their proposed solution — cross-organization auditing — is a start, but it does not solve the fundamental problem for a project manager at 11pm on a Friday who needs to know whether the fix script is safe to run.

What the Research Says About the Psychology

The psychological pattern has a name: automation bias. Georgetown's Center for Security and Emerging Technology (CSET) defines it as "the tendency for an individual to over-rely on an automated system" — including overriding their own judgment in favour of the system's output.

The research literature is extensive and consistent:

Automation bias persists even when humans can see contradicting evidence. MIT researchers found that people followed a robot to wrong locations during emergency evacuations even when they could see exit signs and smoke. In my case, the evidence that the "orphan" user worked was in my own browser — saved credentials that populated when I visited the login page. But the AI's authoritative analysis ("this owner user has never been functional") was more compelling than my own password manager.

Positive first impressions foster excessive trust. KPMG (2025) found that early positive experiences with AI create a baseline trust that subsequent interactions rarely adjust downward. This is not a character flaw — it is a rational heuristic. We trust systems that have proven reliable. The problem is that AI systems can be reliable 95% of the time and catastrophically wrong the other 5%, and our psychology cannot distinguish between "this system is reliable" and "this system is always reliable."

Human-in-the-loop degrades into rubber-stamping. This finding, consistent across DeepMind's research and multiple independent studies, is the most concerning for anyone building governed AI systems. The EU AI Act Article 14 analysis by Melanie Fink (2025) puts it bluntly: "Cognitive limits, automation bias, and time pressure mean humans often don't catch mistakes — and may even make good outputs worse."

Why This Matters Beyond Coding

I am building a platform called Village — sovereign community spaces where families share stories, preserve memories, and maintain their cultural heritage. Part of the long-term vision includes Village AI: locally-trained small language models that help members write stories, summarize discussions, and triage content for moderation.

The herber incident is a microcosm of what will happen inside Villages when Village AI is deployed.

Consider: a family matriarch has had three good experiences with Village AI summarizing her stories. The summaries were accurate, respectful, well-structured. On the fourth request, the AI summarizes a deceased member's story but omits a whakapapa detail that the matriarch, had she read the original, would have noticed. But she does not read the original. Why would she? The last three summaries were fine.

The omission becomes embedded in the community's collective memory. No one notices because the summary looked right. The AI was confident. The matriarch was busy. The family moves on with an incomplete version of their own history.

This is not a hypothetical scenario. It is the exact same psychological dynamic that nearly cost me my login access, scaled to a community of people who trust each other and the tools their community provides.

What We Are Doing About It

Our Village AI governance framework — documented in detail at agenticgovernance.digital — was already designed to address many of these risks. The Tractatus framework embeds 31 governance rules at point-of-execution. The BoundaryEnforcer validates every training step before execution. Christopher Alexander's architectural principles ensure governance is inside the training loop, not bolted on afterward.

But the herber incident revealed gaps that we had not yet addressed:

Gap 1: Pre-validation can share blind spots with execution. If BoundaryEnforcer and MetacognitiveVerifier use the same model of what constitutes a "boundary," they share the same blind spots. Each governance layer must verify using independent logic.

Gap 2: Confidence scales with capability, not correctness. The 5-10% governance overhead we measure is computational cost. It does not measure whether the governance rules themselves are correct. A system that enforces the wrong rules with 100% reliability is worse than one that enforces the right rules with 95% reliability — because the first gives false confidence.

Gap 3: Human verification erodes with trust. Our verification framework includes human review sampling: 100% for flagged content, 25% for grief narratives, 5% random. But as the KPMG research shows, 66% of people skip verification. The better Village AI performs, the less carefully humans will review its output.

Gap 4: "Dry run confirms" does not mean "the action is safe." Validation that uses the same flawed model as the destructive operation will confirm the operation every time. Independent verification requires independent logic.

We are now implementing specific mitigations:

Mandatory friction for irreversible actions — regardless of AI confidence level, any action that modifies community content irreversibly requires human confirmation with the original content visible
Original-first architecture — AI summaries never replace originals; the original must always be accessible, linked, and primary
Error surfacing — monthly transparency reports showing what Village AI got wrong, building calibrated trust rather than blind trust
Independent verification layers — each governance component must verify using logic that does not share assumptions with other components
Explicit uncertainty expression — "I don't know" and "I'm uncertain about..." as first-class outputs, measured alongside accuracy

A Call for Research

This is an area where much more research is needed, and we need help.

The psychological dimension of AI over-trust is under-studied relative to its importance. Most AI safety research focuses on model behavior — making models less likely to produce harmful outputs. But the herber incident shows that the problem is not just what the model outputs. The problem is what happens in the human mind when the model's output looks right.

Specifically, we need research on:

Trust calibration mechanisms that scale. DeBiasMe (arXiv, 2025) shows that metacognitive interventions — prompts like "Did you verify this?" — reduce automation bias. But how do you deploy these in a community platform without creating alert fatigue? How do you calibrate friction so it is proportional to irreversibility without being proportional to annoyance?

Independent verification architectures. How do you build AI governance systems where the verification layers genuinely have independent failure modes? Common-mode failure analysis is well-understood in safety engineering (nuclear reactors, aviation) but barely explored in AI governance. The herber incident is a textbook case of common-mode failure — the fix script and audit script failed simultaneously because they shared an underlying assumption.

Community-specific trust dynamics. The KPMG study surveyed individuals. But in a Village community, trust is social — if one member trusts an AI summary and shares it, the trust transfers to everyone who reads it. How does automation bias propagate through social networks? What happens when a trusted elder shares an AI-generated summary without checking it?

Epistemic humility in language models. OpenAI's research shows models hallucinate because training rewards confident guessing. Can models be trained to express genuine uncertainty? Not "I think this might be..." (a hedge that still implies knowledge) but "I have no information about this and I am guessing" (a direct statement of epistemic limits)?

The 75%-25% ratio. MIT GOV/LAB (2025) found that a 75%-human/25%-AI ratio generated the greatest citizen acceptance in participatory governance. Does this ratio hold for community AI? Should Village AI be explicitly positioned as a contributor, never as an authority — and should the UI always show the human-to-AI ratio of any output?

If you are a researcher working on any of these questions, or if you are building community AI systems and grappling with the same problems, I would very much like to hear from you. The Village project is committed to open governance documentation — everything described here is available at agenticgovernance.digital.

The Lesson I Cannot Outsource

The deepest lesson from the herber incident is personal, and I suspect it applies to anyone who uses AI tools seriously.

My dilemma is not technical. It is not even philosophical. It is psychological. I am not motivated to check because my immediately preceding set of experiences affirm that the solution provided by Claude Code works — the same way that I expect water to flow out of a tap.

But taps do not hallucinate. Taps do not produce wrong water that looks right. The metaphor that served me so well — AI as reliable infrastructure — is itself a cognitive trap. AI is not infrastructure. It is a confident collaborator that is usually right and occasionally, catastrophically, precisely wrong.

The question is not whether I can build systems to catch these errors. I can, and I am. The question is whether I will remain motivated to use those systems when the AI's track record keeps telling me I do not need to.

That question is not one I can answer with architecture. It is one I have to answer every day, at 11pm on a Friday, when the AI says "all tasks complete" and the --apply flag is one command away.

References

KPMG/University of Melbourne, "Global Trust in AI Study 2025" — 48,000+ respondents, 47 countries
OpenAI o3/o4-mini System Card, April 2025
OpenAI, "Why Language Models Hallucinate," September 2025
Georgetown CSET, "AI Safety and Automation Bias," November 2024
"(Over)Trusting AI Recommendations," Human-Computer Interaction (Tandfonline), 2024
"The Generative AI Paradox," arXiv/MDPI Future Internet, January 2026
Anthropic, "Building and Evaluating Alignment Auditing Agents," 2025
International AI Safety Report 2025, Bengio et al.
Melanie Fink, "Human Oversight under EU AI Act Article 14," SSRN, February 2025
"DeBiasMe: De-biasing with Metacognitive Interventions," arXiv, 2025
"Mitigating Automation Bias Through Nudges," ScienceDirect, 2025
Ayanna Howard, "In AI We Trust — Too Much?" MIT Sloan Management Review
Anthropic, "Collective Constitutional AI," ACM FAccT 2024
MIT GOV/LAB, "Human-AI Ratio in Participatory Governance," 2025

John Stroh is the founder of the Village platform (mysovereignty.digital) and the agentic governance research project (agenticgovernance.digital).

The Village AI governance framework is open source and available at agenticgovernance.digital.

Village is currently in beta pilot, with Guardian Agents included in all subscriptions. We are accepting applications from communities and organisations. Beta founding partners receive locked-for-life founding rates.