🤖 AI Research Edition Article 5 of 7

What Is Live in Production — An Unvarnished Inventory

Scope

This article describes the system as it exists in production as of June 2026. Where a capability is planned but not yet deployed, we say so. Where a capability is deployed but has known limitations, we describe those limitations. The goal is an inventory that a researcher could use to assess the system's maturity and the claims made elsewhere in this series. (Series-specific terminology is defined in the glossary.)

Village AI has been in production since 2025 — currently at founding-charge launch. It is a young system operating at modest scale. The following is a technical description of the deployed architecture.

Model Architecture

Base model: a 14B-parameter Qwen2-family fine-tune, villageai-14b-community-v1, serving as the foundation and default layer for all non-specialised tenants (set via OLLAMA_MODEL). Trained on platform operational content: feature documentation, navigation patterns, help query patterns, and community interaction conventions. It is a strong generalist, not a cut-down model — it is also what the CPU fallback path serves (see below).

Specialised layers: Per-product-type fine-tuned models deployed via a routing layer (model-routing.js). The production specialisations are the villageai-14b-{community,whanau,episcopal,family,business}-v1 family, each fine-tuned on its product type's content (the Episcopal specialisation, villageai-14b-episcopal-v1, on Anglican liturgical, pastoral, and governance material). Further product types (e.g. conservation, alumni, diaspora) are planned but not yet trained. Vision and OCR tasks route to a local qwen2.5vl:7b through the same InferenceRouter.

Model routing: getModelForTenant (in model-routing.js) selects the model from the tenant's product type. If a specialisation exists for that product type, it is used; otherwise the base model serves the request. A single-GPU inference host can set OLLAMA_FORCE_BASE_MODEL=true to force all tenants onto the base 14B — this prevents a 24 GB, single-model-resident Ollama from thrashing between per-product fine-tunes. An enhanced tier exists in the routing code (ENHANCED_MODELS) but is currently an empty map: it is reserved for future premium features and falls back to standard, so no tenant is served an enhanced model today. Note there is no fast-versus-deep reasoning split; the real tiering is base-versus-specialised weights, GPU/CPU failover, and the verification layer.

Inference hardware: Primary inference runs on a home AMD RX 7900 XTX (24 GB), reached over a WireGuard tunnel (OLLAMA_GPU_URL) from the application server (OVH France) at roughly 150 tokens/second. The GPU is not co-located with the application server — requests traverse the VPN, adding network latency to the inference path. On GPU outage, inference fails over to CPU on the OVH host running the same villageai-14b-community-v1 weights (OLLAMA_CPU_MODEL) at roughly 7 tokens/second — degraded throughput, not a smaller model. Separately, an NVIDIA A6000 (48 GB) on Catalyst NZ is a training rig, shelved between fine-tuning sprints; it is not in the inference path.

Framework: Inference is managed through Ollama, with the InferenceRouter handling model selection and request routing. The system does not use any third-party inference API; all generation occurs on controlled infrastructure.

Retrieval-Augmented Generation

Vector store: Qdrant, storing 384-dimensional embeddings of community content (stories, announcements, documents, event descriptions, governance records). The backend is selected via VECTOR_STORE_BACKEND.

Embedding pipeline: The EmbeddingService processes community content into vector representations using a local Xenova/all-MiniLM-L6-v2 model (384-dim) — no third-party embedding API. Content is chunked, embedded, and indexed per tenant, maintaining strict tenant isolation at the vector store level.

Retrieval at inference time: User queries are embedded and used for cosine similarity search against the tenant's document corpus. Retrieved documents are provided as context to the generation model, grounding its responses in the community's actual content.

Content indexing: A ContentIndexer service processes new and updated content into the vector store. Indexing respects consent boundaries — content not explicitly shared for AI use is not indexed.

Guardian Agent Pipeline

Every AI response passes through four Guardian Agent layers before reaching the user. The pipeline is implemented in src/services/guardians/ and is structurally independent of the generation model.

Layer 1: AccuracyVerifier

Embeds the model's response and computes cosine similarity against the source documents retrieved for the query
Flags responses below a configurable similarity threshold
Provides a grounding score that feeds into the user-facing confidence indicator
Known limitation: Cosine similarity is a semantic proximity measure, not proof of factual accuracy. Two sentences can be semantically close while differing on key details.

Layer 2: HallucinationDetector

Decomposes responses into individual claims
Verifies each claim independently against the source corpus
Flags individual claims that lack grounding, even in otherwise well-grounded responses
Known limitation: Claim decomposition itself uses language model inference, introducing a dependency on model capability. Malformed claim decomposition can produce false negatives.

Layer 3: AnomalyDetector + PressureMonitor

Tracks distributional properties of model outputs over time
Detects vocabulary drift, topic anomalies, and response characteristic changes
PressureMonitor tracks inference conditions: context length, query complexity, concurrent load
Under elevated pressure, verification thresholds are tightened and confidence ceilings are lowered
Known limitation: Anomaly detection requires a baseline period. For new tenants with limited interaction history, the baseline is sparse and anomaly detection is less effective.

Layer 4: ResponseReviewer + RegressionMonitor + Adaptive Feedback

Processes member feedback (thumbs-down signals, moderator corrections)
RootCauseClassifier categorises failure modes
FeedbackInvestigator examines whether failures represent systematic patterns
RegressionMonitor tracks whether previously corrected failure modes recur
TrainingPairGenerator creates training data from confirmed corrections for future fine-tuning cycles
Known limitation: The feedback loop depends on member engagement. Communities with low feedback volume produce insufficient signal for meaningful pattern detection.

Pre-Inference Protection

A PreInferenceProtector operates before generation, screening inputs for injection patterns and routing certain query types directly to human review. This is a conservative filter — it errs on the side of blocking — and is separate from the post-generation Guardian pipeline.

Framework Verification Trio and the Help-Stream Gates

Two further verification paths run alongside the four Guardian layers, and they are worth separating out because they are frequently conflated:

The framework trio. Chat responses are annotated by chatAnnotator.js, which always runs three structurally independent framework checks — BoundaryEnforcer (content/domain-gate enforcement), CrossReferenceValidator (RAG cross-checking), and MetacognitiveVerifier (verification history) — and, for whānau and Māori pilots, deepens into a PluralisticDeliberationOrchestrator pass. These are the framework's own governance components, distinct from the src/services/guardians/ suite.
The Help-stream Gate 1/2/3. The Help WebSocket path (HelpWebSocketService.js) enforces a literal three-gate pipeline over the streamed response (including mid-stream repetition detection and abort at Gate 2). This is a concrete streaming safeguard, not a rebranding of the four Guardian layers or the framework trio above.

Beyond these, the broader Guardian suite includes ResponseReviewer, HallucinationDetector, AccuracyVerifier, AnomalyDetector, PreInferenceProtector, a ModelRollbackService, and the adaptive-learning loop. The point of naming them separately is that "verification" here is not one gate but several partly-overlapping mechanisms on different epistemic bases (Article 3).

What the System Can Do Today

Community-grounded question answering. Given a query about community content ("When is the next vestry meeting?", "What did the rector say about the building fund?"), the system retrieves relevant documents and generates a response grounded in that content. If no relevant documents are found, the system indicates this rather than generating from the base model's priors.

Drafting assistance. The system can generate draft bulletins, announcements, and correspondence that reflect the community's tone and vocabulary. All drafts are reviewed by a moderator before publication.

Document summarisation. Long documents (vestry minutes, policy documents) can be summarised with key points extracted.

Translation support. The platform supports five languages: English, German, French, Dutch, and te reo Māori. Translation uses DeepL (not the generation model) for accuracy.

Feedback triage — and the system's deployed agentic surface. Member feedback is automatically classified, investigated where possible, and routed to the appropriate moderator. This is also where the system acts rather than merely answers: the ResolutionExecutor pursues a high rate of automatic resolution for routine, low-stakes cases — investigating the correct answer against the corpus and updating the knowledge base autonomously — while a detected pattern of related failures is escalated for human adjudication rather than acted upon. This is deliberately bounded agency: multi-step action confined to the tenant boundary and to reversible operations, an instantiation of the boundary-enforcement principle (Article 3) rather than open-ended browser/computer-use autonomy. It is the closest the deployed system comes to an "agent" in the 2025–2026 sense, and the constraints are the point.

OCR and document processing. The DocumentExtractor service processes scanned documents, making their content searchable and available for RAG retrieval.

Vocabulary System

The vocabulary system (product-vocabularies.js, vocabulary.js) adapts the platform's terminology to the community type. This operates at two levels:

Interface level: UI labels, navigation terms, and feature names are replaced with domain-appropriate vocabulary. An Episcopal parish sees "parishioners," "vestry governance," and "parish bulletins" rather than generic platform terminology.

Model level: The vocabulary shapes the context provided to the model. When the system refers to "parishioners" rather than "users" in the prompt context, the model's output reflects that framing. This is a lightweight intervention — it operates at the prompt level, not the weight level — but it reduces the friction between the model's distributional priors and the community's terminology.

Nine product types are defined: community, family, conservation, diaspora, clubs, business, alumni, whanau, and episcopal. Each has a distinct vocabulary mapping.

What Is Not Yet Proven

We enumerate specific claims that have not been validated:

Guardian Agent efficacy under adversarial conditions. The system has not been subjected to systematic red-teaming. Guardian Agent performance under adversarial prompting, deliberate attempts to elicit hallucination, or coordinated injection attacks is unknown.

Specialised layer generalisation. Specialisations now exist for several product types (villageai-14b-{community,whanau,episcopal,family,business}-v1). Whether the Specialised Layer strategy generalises effectively — i.e. measurably reduces silent distributional reversion — across these domains, and to as-yet-untrained ones (conservation ecology, te reo Māori cultural contexts, family genealogy), has not been empirically demonstrated. Deployment is not evidence of efficacy.

Cosine similarity threshold calibration. The similarity thresholds used by the AccuracyVerifier were set based on development testing and early production experience. They have not been optimised through systematic evaluation against a labelled dataset of grounded and ungrounded responses.

Long-term distributional stability. The system has been in production for approximately eight months. Whether the base model's priors reassert themselves over time — a slow drift back towards training distribution despite fine-tuning — has not been observed over a sufficient time horizon to draw conclusions.

Cross-lingual verification. For communities operating in languages other than English, the Guardian Agent pipeline operates on embeddings of the non-English text. Whether cosine similarity verification is equally effective across languages has not been systematically evaluated.

Feedback loop convergence. The adaptive feedback mechanism (Layer 4) is designed to improve system behaviour over time. Whether it converges to stable, improved performance or exhibits oscillatory or divergent behaviour under certain feedback patterns has not been formally analysed.

We present these not as deferrals but as open questions. The system is operational; these questions are unanswered.

Infrastructure

Application server: OVH France (EU data jurisdiction)
Database: MongoDB with strict tenant isolation (every query filtered by tenantId)
Vector store: Qdrant (tenant-scoped collections, 384-dim), backend via VECTOR_STORE_BACKEND
GPU inference: home AMD RX 7900 XTX (24 GB) via WireGuard (OLLAMA_GPU_URL), ~150 tok/s
CPU fallback: OVH host running the same villageai-14b-community-v1 weights (~7 tok/s), not a smaller model
Training rig (offline): NVIDIA A6000 (48 GB) on Catalyst NZ, shelved between fine-tuning sprints — not in the inference path
Authentication: httpOnly cookies, tenant context isolation, no public access
Deployment: Self-hosted, no third-party cloud AI services
Media: Bunny CDN for edge caching (OVH server)

All inference occurs within the operator's infrastructure. No prompts, responses, or community content are transmitted to third-party AI providers.

Full technical architecture: Village AI on Agentic Governance. Practitioner courses on operating these systems under human control: Working with Claude and Agents at Work.

Useful? Share this article, or show a QR code to scan.