🎓 Academia Edition Article 5 of 7

A field of cosmos flowers under an open sky

A Production System Under Examination — What Is Deployed Today

Scope and Intent

This article provides an inventory of what is currently deployed in the Village platform, what remains under development, and where the gap between architectural intent and operational reality is widest. It is written with the understanding that a research audience requires candour about system maturity — what works, what does not yet work, and what has not been tested. (Terms used throughout the series are defined in the glossary.)

The platform has been in production since 2025. It serves a small number of communities. The deployment base is insufficient for statistical claims about effectiveness, and this article does not make such claims.

Operational Capabilities

The following capabilities are deployed and operational at the time of writing:

Content-grounded query answering

The AI subsystem responds to member queries by retrieving and synthesising information from the community's own document corpus — announcements, shared narratives, event records, organisational documents. Responses are verified against the corpus by the Guardian Agent layer before delivery.

What works: For queries that map directly to documented content ("When is the next meeting?", "What was decided about the building fund?"), the system produces grounded, verifiable responses. The semantic grounding layer correctly identifies relevant source documents in the majority of observed cases.

What does not work reliably: For queries that require inference across multiple documents, or that address topics sparsely covered in the community's records, output quality degrades. The system may produce plausible but ungrounded responses, flagged with low confidence indicators that not all users attend to.

What is untested: The system's performance under adversarial querying — deliberate attempts to elicit ungrounded or inappropriate outputs — has not been systematically evaluated. Informal testing suggests the boundary enforcement layer catches many adversarial patterns, but a formal red-team assessment has not been conducted.

Drafting assistance

The AI assists with drafting community communications — announcements, bulletins, correspondence. Drafts are generated based on the community's existing content patterns and are subject to moderator review before distribution.

Limitation: The system's drafting quality is directly constrained by the volume and quality of the community's existing content. For communities with sparse records, drafts tend to revert toward the base model's distributional defaults — precisely the failure mode the architecture is designed to prevent. The mitigation (moderator review) is effective but introduces a human bottleneck.

Document summarisation

Long documents and collections of announcements can be summarised. This capability is straightforward and well-served by current LLM technology.

Multilingual support

The platform supports five languages: English, German, French, Dutch, and Te Reo Māori. Translation is handled by a dedicated translation service (DeepL), not by the LLM. This architectural decision — separating translation from generation — avoids the known failure mode of LLM-generated translations that alter meaning while maintaining fluency.

Feedback triage

Member feedback is automatically classified, investigated where possible, and routed to appropriate human responders. The triage system uses root-cause classification to identify patterns in feedback and escalate systemic issues.

This subsystem is the platform's clearest deployed instance of bounded agentic action. For routine, low-stakes cases the executor does not merely classify but resolves — investigating the correct answer against the corpus and updating the knowledge base autonomously — whereas a detected pattern of related failures is escalated for human adjudication rather than acted upon. The agent acts where the operation is reversible and confined to the community boundary, and defers where it is not. This is a deliberately narrow realisation of agency: it instantiates the boundary-enforcement principle (Article 3) rather than the open-ended autonomy characteristic of commercial browser/computer-use agents.

What works: Routine feedback (feature requests, navigation questions, content queries) is correctly classified and handled in the majority of observed cases.

What does not work reliably: Feedback that involves nuanced interpersonal context or community-specific cultural references is sometimes misclassified. The system's error rate for culturally-sensitive feedback has not been formally measured.

The Vocabulary System: Linguistic Framing as Governance

The platform implements a vocabulary system that adapts all user-facing terminology to the community type. A research group sees "research group" and "collaborators"; a conservation society sees "members" and "conservation projects"; a parish sees "parishioners" and "vestry governance."

This is not a cosmetic feature. The vocabulary shapes the AI's frame of reference for query interpretation and response generation. When the system processes a query in the context of a vocabulary that uses "collaborators" rather than "users," the response distribution shifts toward collaborative and communal framing.

Research interest: The vocabulary system provides a natural experiment in how linguistic framing affects AI output distributions. Systematic comparison of outputs across vocabulary configurations — holding the query constant while varying the vocabulary — would test the hypothesis that surface-level terminological changes propagate through to substantive framing differences in outputs. This experiment has not been conducted but is feasible with the existing infrastructure.

Limitation: The vocabulary system operates at the level of terminology, not at the level of conceptual framework. Changing "users" to "collaborators" shifts the distributional surface but may not alter deeper structural assumptions embedded in the base model. The depth of the vocabulary system's influence on output quality is an open question.

Guardian Agent Performance

The four Guardian Agent layers are deployed and operational. Their performance characteristics, to the extent currently observable:

Semantic grounding (Guardian 1): Correctly identifies relevant source documents for straightforward queries. Performance degrades for multi-document inference and for queries that require implicit knowledge not directly stated in source documents.

Claim decomposition (Guardian 2): Successfully isolates individual claims in structured responses. Less effective for responses that embed claims in complex syntactic structures or express claims implicitly through framing rather than explicit statement.

Drift monitoring (Guardian 3): Operational, but the deployment period is too short to have detected meaningful longitudinal drift. The system has baseline measurements; whether it can detect gradual distributional shift over months or years is untested.

Adaptive feedback (Guardian 4): Incorporates member and moderator feedback into verification thresholds. The feedback volume from the current deployment base is low, limiting the system's ability to learn community-specific patterns. This is a bootstrapping problem: the system improves with feedback, but early-stage communities provide insufficient feedback for the system to improve substantially.

What Remains Under Development

The following components are designed but not yet fully operational:

Model routing optimisation. The system operates two model tiers — a faster, smaller model for routine queries and a larger model for complex reasoning tasks. The routing logic that determines which queries go to which model is functional but not optimised. Some queries that would benefit from deeper processing are currently handled by the faster model, resulting in lower-quality responses.

Individual personalisation. The system currently operates at the community level — it knows the community's content but does not model individual member preferences or interaction patterns. Individual-level personalisation is planned but raises additional governance questions (consent, profiling, filter bubbles) that have not been resolved.

Moderator training and accreditation. The governance architecture assumes competent moderators who can review AI outputs and provide corrective feedback. A structured training programme for moderators is designed but in early stages of deployment. The quality of governance is directly dependent on moderator competence, which is currently variable.

Failure Modes Observed in Practice

Transparency about observed failures is a necessary component of any credible system description:

Confident generation of ungrounded claims. The system occasionally produces responses that sound authoritative but are not supported by the community's records. The Guardian Agent layer catches many of these, but not all — particularly when the ungrounded claim is semantically similar to actual content.

Vocabulary bleed-through. Under complex queries, the base model's corporate-default vocabulary sometimes overrides the community-specific vocabulary. This is the distributional drift problem described in Article 1, partially mitigated but not eliminated by the vocabulary system.

Feedback sparsity. Communities in early stages of adoption generate insufficient feedback for the adaptive learning mechanisms to function effectively. This creates a cold-start problem where the system is least well-calibrated precisely when the community most needs it to be reliable.

Moderator fatigue. The governance architecture places significant review burden on volunteer moderators. In communities where the moderator role is under-resourced, review quality declines, reducing the effectiveness of the human-in-the-loop governance layer.

What This Means for Research

The Village platform, in its current state, is a functioning prototype of community-scale AI governance. It is not a mature, validated system. The architectural principles are implemented, but the empirical evidence for their effectiveness is preliminary.

For researchers, this represents both a limitation and an opportunity. The limitation is that claims about the framework's governance effectiveness cannot yet be substantiated with rigorous evidence. The opportunity is that the platform provides a live research environment — an operational system with instrumented governance layers, deployed across multiple community types — where hypotheses about AI governance can be tested empirically.

The authors welcome collaboration with researchers interested in evaluating the framework's claims. The codebase is open-source, the governance logs are available to community moderators, and the architecture is designed to support the kind of instrumentation that empirical governance research requires.

For the full technical architecture, see Village AI on Agentic Governance. Readers evaluating what may safely be delegated to a system that acts — as against merely answers — may find the companion courses useful: Working with Claude and Agents at Work.

Useful? Share this article, or show a QR code to scan.