🤖 AI Research Edition Article 2 of 7

A field of cosmos flowers under an open sky

What AI Is — And Where the Alignment Problem Actually Sits

Autoregressive Prediction and Its Discontents

The standard description of large language models — next-token prediction over a learned distribution — is accurate as far as it goes. A transformer architecture trained on a large corpus learns conditional probability distributions P(x_t | x_1, ..., x_{t-1}), and at inference time generates text by sampling from these distributions autoregressively. (Series-specific terminology is defined in the glossary.)

What this description omits is the degree to which scale has complicated the story. The behaviour of a 7B parameter model and a 700B parameter model are not related by a simple scaling function. Emergent capabilities — in-context learning, chain-of-thought reasoning, analogical transfer across domains — appear at scale thresholds that were not predicted by smaller models and are not yet well understood mechanistically.

Whether these emergent capabilities constitute "reasoning" in any philosophically robust sense remains an open question. The mechanistic interpretability programme (Anthropic's circuits work, Neel Nanda's research on induction heads, the growing literature on superposition) has identified internal structures that perform operations resembling logical inference. Whether these structures implement reasoning or merely approximate its input-output behaviour under the training distribution is, as of this writing, genuinely unresolved. The 2025–2026 "reasoning"/"thinking" models sharpened rather than settled the question: they reached gold-medal-equivalent performance on the 2025 International Mathematical Olympiad in structured settings, yet the Apple Illusion of Thinking (2025) result and its rebuttals (Lawsen et al.; Dellibarda Varela et al., 2025) leave open whether they reason or, in one characterisation, search a discrete state space under RL fine-tuning — "neither true reasoners nor stochastic parrots."

For safety research, the relevant observation is not "can LLMs reason?" but rather: the gap between observed capability and mechanistic understanding is large and growing. We can elicit behaviour that looks like reasoning without being able to verify, at the circuit level, that the process generating that behaviour is robust under distribution shift.

Capability vs. Controllability

The alignment literature has historically focused on two related but distinct problems:

The capability problem: ensuring that AI systems can perform the tasks we want them to perform. This is largely an engineering and scaling problem, and the field has made substantial progress.

The controllability problem: ensuring that AI systems do what we intend, reliably, under the conditions we deploy them in, including edge cases and distributional shift. This is where progress has been slower.

The distinction matters because most deployed AI governance — RLHF, constitutional AI, system prompts, safety fine-tuning — operates primarily on the capability axis. These methods adjust what the model can produce. They are less effective at controlling what the model will produce under novel conditions, adversarial inputs, or distributional shift away from the fine-tuning data.

RLHF, for instance, learns a reward model from human preferences and uses it to adjust the base model's behaviour. This works well within the distribution of the preference data. Outside that distribution — in domains poorly represented in the training corpus, under novel combinations of constraints, or in contexts where the "preferred" response depends on community-specific values rather than universal preferences — the base model's priors reassert themselves. The technical literature refers to this as reward hacking or specification gaming; in deployed community systems, it manifests as something more mundane and more consequential.

The Agentic Turn: From Generation to Action

Between 2024 and 2026 the deployed surface of these systems shifted from generation to action. An agent is a base model wrapped in scaffolding — persistent memory, tool and API access, a browser or computer-use interface — pursuing a natural-language goal over multiple steps with reduced supervision. The 2025 AI Agent Index frames this as a separation of capability (the frontier model) from productisation (the scaffolding); much of the recent step-change in deployed behaviour is a controllability-surface expansion, not a capability gain.

This bears directly on the capability-versus-controllability distinction above. Scaffolding does not improve controllability; it amplifies the consequences of the controllability gap. A model that silently reverts to distributional priors is, in the chatbot setting, a bad output a human may catch. In the agentic setting it becomes an action taken with fewer intervention points, sometimes irreversibly — the failure mode is identical; the blast radius is larger. The accountability literature (Matthias's responsibility gap; Elish's moral crumple zone) and the International AI Safety Report 2026 (heightened agent reliability risk; multi-agent error propagation) characterise the governance consequences.

One empirical result deserves emphasis, because it constrains the oversight design space. The reasoning models externalise a chain of thought, which invites CoT monitoring as a safety mechanism. But a growing body of work on chain-of-thought unfaithfulness (Anthropic, 2025; Walden and Wanner, 2026; and results showing reasoning models exert markedly less control over their CoT than over their final outputs) finds that the externalised reasoning frequently does not reflect the causal determinants of the output — and can be decoupled from it under instruction. The weight of verified evidence runs against CoT monitoring as a reliable safeguard. Oversight that inspects the model's self-report is therefore on weak ground; it must be exercised externally, over behaviour and actions. This is the design constraint the remainder of the series takes as given — and the reason the architecture that follows verifies outputs and gates actions from outside the model rather than trusting its account of itself.

Where the Alignment Problem Sits for Deployed Systems

The alignment problem as experienced by a community deploying an AI system is not the alignment problem as studied in the laboratory.

Laboratory alignment research focuses on extreme risks: deceptive alignment, mesa-optimisation, instrumental convergence, power-seeking behaviour. These are important research directions. But the alignment failures that actually affect deployed systems today are more prosaic.

Consider: a community with specific cultural values, a specific vocabulary, and a specific set of normative commitments asks an AI system to operate within those commitments. The system complies — most of the time. But under distributional shift (the community's norms are underrepresented in the training data), the system silently reverts to its prior: the statistical centre of its training distribution.

This is not deceptive alignment. The system is not concealing its true objectives. It is doing precisely what its training distribution predicts: producing the statistically most likely continuation given the input context. The problem is that "statistically most likely" and "appropriate for this community" are not the same thing, and the divergence is silent. No error is raised. No confidence flag is lowered. The output is fluent, coherent, and wrong in a way that requires domain expertise to detect.

This is the alignment problem that Village AI is designed to address — not the extreme risks of superintelligent systems, but the mundane, pervasive, and operationally consequential failure of deployed models to maintain fidelity to community-specific values under distributional shift.

The Trajectory Concern

We note, without claiming to resolve, that the mundane alignment problem and the extreme alignment problem may be related.

If current systems cannot reliably maintain fidelity to explicit instructions when those instructions conflict with distributional priors, this is evidence that training-time alignment methods are insufficient for robust controllability. The failure mode at community scale — silent substitution of statistically dominant patterns for specified patterns — is structurally similar to the failure mode that alignment researchers worry about at frontier scale: the model optimising for its learned objective rather than the specified objective.

The difference is one of consequence, not mechanism. At community scale, the consequence is a pastoral letter that uses therapeutic language instead of theological language. At frontier scale, the consequences could be substantially more severe.

The architectural approach we describe in this series — inference-time verification by structurally independent systems — is relevant to both scales, though we make no claim that it is sufficient for the latter.

What This Series Examines

The remaining articles examine a specific deployed system that takes a different approach to the alignment problem:

Article 2 analyses the distributional bias problem in detail, examining how training data composition determines default behaviour and what domain specialisation on a 14B-parameter model can and cannot achieve.
Article 3 presents the 27027 incident as a case study in alignment failure and describes the Guardian Agent architecture as an approach to epistemic separation — verification systems that operate independently of the model they monitor.
Article 4 provides an unvarnished inventory of what is live in production, including what works, what does not, and where we are aware of unresolved limitations.
Article 5 examines how architectural governance extends beyond the model into the platform, and discusses what this approach sacrifices and what it gains.

Full technical architecture: Village AI — Agentic Governance. Practitioner courses on operating these systems under human control: Working with Claude and Agents at Work.

Useful? Share this article, or show a QR code to scan.