🤖 AI Research Edition

What AI Is

English

What AI Is — And Where the Alignment Problem Actually Sits


Series: Architectural AI Governance at Community Scale — A Technical Examination of Village AI (Article 1 of 5) Author: My Digital Sovereignty Ltd Date: March 2026 Licence: CC BY 4.0 International


Autoregressive Prediction and Its Discontents

The standard description of large language models — next-token prediction over a learned distribution — is accurate as far as it goes. A transformer architecture trained on a large corpus learns conditional probability distributions P(x_t | x_1, ..., x_{t-1}), and at inference time generates text by sampling from these distributions autoregressively.

What this description omits is the degree to which scale has complicated the story. The behaviour of a 7B parameter model and a 700B parameter model are not related by a simple scaling function. Emergent capabilities — in-context learning, chain-of-thought reasoning, analogical transfer across domains — appear at scale thresholds that were not predicted by smaller models and are not yet well understood mechanistically.

Whether these emergent capabilities constitute "reasoning" in any philosophically robust sense remains an open question. The mechanistic interpretability programme (Anthropic's circuits work, Neel Nanda's research on induction heads, the growing literature on superposition) has identified internal structures that perform operations resembling logical inference. Whether these structures implement reasoning or merely approximate its input-output behaviour under the training distribution is, as of this writing, genuinely unresolved.

For safety research, the relevant observation is not "can LLMs reason?" but rather: the gap between observed capability and mechanistic understanding is large and growing. We can elicit behaviour that looks like reasoning without being able to verify, at the circuit level, that the process generating that behaviour is robust under distribution shift.

Capability vs. Controllability

The alignment literature has historically focused on two related but distinct problems:

The capability problem: ensuring that AI systems can perform the tasks we want them to perform. This is largely an engineering and scaling problem, and the field has made substantial progress.

The controllability problem: ensuring that AI systems do what we intend, reliably, under the conditions we deploy them in, including edge cases and distributional shift. This is where progress has been slower.

The distinction matters because most deployed AI governance — RLHF, constitutional AI, system prompts, safety fine-tuning — operates primarily on the capability axis. These methods adjust what the model can produce. They are less effective at controlling what the model will produce under novel conditions, adversarial inputs, or distributional shift away from the fine-tuning data.

RLHF, for instance, learns a reward model from human preferences and uses it to adjust the base model's behaviour. This works well within the distribution of the preference data. Outside that distribution — in domains poorly represented in the training corpus, under novel combinations of constraints, or in contexts where the "preferred" response depends on community-specific values rather than universal preferences — the base model's priors reassert themselves. The technical literature refers to this as reward hacking or specification gaming; in deployed community systems, it manifests as something more mundane and more consequential.

Where the Alignment Problem Sits for Deployed Systems

The alignment problem as experienced by a community deploying an AI system is not the alignment problem as studied in the laboratory.

Laboratory alignment research focuses on extreme risks: deceptive alignment, mesa-optimisation, instrumental convergence, power-seeking behaviour. These are important research directions. But the alignment failures that actually affect deployed systems today are more prosaic.

Consider: a community with specific cultural values, a specific vocabulary, and a specific set of normative commitments asks an AI system to operate within those commitments. The system complies — most of the time. But under distributional shift (the community's norms are underrepresented in the training data), the system silently reverts to its prior: the statistical centre of its training distribution.

This is not deceptive alignment. The system is not concealing its true objectives. It is doing precisely what its training distribution predicts: producing the statistically most likely continuation given the input context. The problem is that "statistically most likely" and "appropriate for this community" are not the same thing, and the divergence is silent. No error is raised. No confidence flag is lowered. The output is fluent, coherent, and wrong in a way that requires domain expertise to detect.

This is the alignment problem that Village AI is designed to address — not the extreme risks of superintelligent systems, but the mundane, pervasive, and operationally consequential failure of deployed models to maintain fidelity to community-specific values under distributional shift.

The Trajectory Concern

We note, without claiming to resolve, that the mundane alignment problem and the extreme alignment problem may be related.

If current systems cannot reliably maintain fidelity to explicit instructions when those instructions conflict with distributional priors, this is evidence that training-time alignment methods are insufficient for robust controllability. The failure mode at community scale — silent substitution of statistically dominant patterns for specified patterns — is structurally similar to the failure mode that alignment researchers worry about at frontier scale: the model optimising for its learned objective rather than the specified objective.

The difference is one of consequence, not mechanism. At community scale, the consequence is a pastoral letter that uses therapeutic language instead of theological language. At frontier scale, the consequences could be substantially more severe.

The architectural approach we describe in this series — inference-time verification by structurally independent systems — is relevant to both scales, though we make no claim that it is sufficient for the latter.

What This Series Examines

The remaining articles examine a specific deployed system that takes a different approach to the alignment problem:


This is Article 1 of 5 in the "Architectural AI Governance at Community Scale" series. For the full technical architecture, visit Village AI — Agentic Governance.

Next: Foundation Models vs. Domain-Specialised Inference — A Structural Analysis

Published under CC BY 4.0 by My Digital Sovereignty Ltd. You are free to share and adapt this material, provided you give appropriate credit.