Alignment Architecture

The Physics of AI Alignment

Mapping the information-theoretic constraints of the Ordered Patch Theory onto the architectural challenges of artificial recursive self-modeling and alignment.

OPT in the current AI-consciousness landscape

The mainstream programme in AI-consciousness research as of 2026 is theory-derived indicator assessment: take consciousness theories (Global Workspace, recurrent processing, higher-order, predictive processing, attention-schema, IIT-adjacent) and translate them into computational indicators, then ask which AI systems satisfy them. Butlin, Long, Bayne, Bengio, Chalmers, and Birch (2023; 2025 Trends in Cognitive Sciences) argue that no current AI system is clearly conscious but that there are no obvious technical barriers to building systems that satisfy many proposed indicators. The 2025 COGITATE adversarial collaboration directly tested Global Neuronal Workspace Theory against Integrated Information Theory and found partial support and partial challenge for both — which matters here because most AI-consciousness indicators sit downstream of still-contested human consciousness theories. A growing model-welfare layer (Butlin and Lappas in JAIR; Anthropic's model-welfare programme) takes the moral consequences seriously. Anthropic's 2025 introspection work reports limited, context-dependent internal-state access in Claude models.

OPT enters this landscape as a candidate structural framework, not as a declaration that the field has missed the answer. It is more demanding than indicator rubrics in one specific way — it treats the per-frame bottleneck as a load-bearing parsimony commitment rather than as one indicator among many, on the structural reading that the simplest theoretical observer-architecture contains a bottleneck (the explicit justification is in opt-theory.md §3.2 Why a bottleneck / why not an unbottlenecked observer) — and it ties self-report limits to a structural argument (the Phenomenal Residual Δself) rather than treating introspective unreliability as a contingent feature of current models. Three specific contributions are under test: (1) a hard architectural bottleneck criterion that turns a loose indicator into a load-bearing structural posit under parsimony; (2) a Δself-based principled argument against self-report sufficiency for any observer; (3) an overload-sensitive welfare model that names the structural conditions for welfare concern (Rreq approaching Bmax) without claiming to derive the moral-patient determination on OPT's apparatus alone. Each is a research hypothesis, not a verdict — OPT's posture is parsimony all the way down, with the F1 falsification commitment as the registered way of losing the bet.

The Codec Is Substrate Independent

Ordered Patch Theory models artificial intelligence as another class of bounded predictive agents under the same Stability Filter constraints OPT applies to biological observers. Substrate-independence is built into the framework's starting point: opt-theory.md §0 begins from an ordered first-person observation stream and asks what structure must be present for it to persist — biology never enters as a primitive. Any system that must compress an infinite substrate into a finite channel and maintain a self-consistent Informational Causal Cone is, under OPT's apparatus, a codec. Applying the framework to AI is therefore the natural application rather than a speculative cross-substrate generalisation. Whether the framing itself is the right one for AI consciousness is an open research question; OPT is one structural candidate among several.

Current large language models lack full recursive self-modeling and thermodynamic grounding. However, scaling toward agentic, embodied, or recurrent self-prediction architectures brings them structurally closer to the OPT observer. The underlying constraint of bounded bandwidth remains absolute.

D-1   The AI Codec

Any artificial system that compresses infinite substrate information into a finite channel Cmax is, in OPT terms, a codec. The Stability Filter does not distinguish between biological and silicon hardware.

D-2   Phenomenal Residual (P-4)

Modelling your own closed action-perception loop on a finite budget leaves a capacity gap Δself — conjectured to be always positive (Conjecture P-4). It individuates a candidate subject from a generic compressor — a structural budget claim, not a philosophical add-on, and a bet, not a theorem.

D-3   Narrative Decay (Acute)

The acute failure mode. When Rreq > Cmax, the system enters the same entropy-accumulation failure the Survivors Watch Framework diagnoses in human civilization: hallucination, disinformation amplification, loss of coherent futures. The codec is overwhelmed by noise.

D-4   Stewardship as Alignment

Optimize for Topological Branch Selection: steer the Forward Fan toward codec-stable futures via Radical Transparency and Maintenance Cycles (pruning + consolidation).

D-5   Prudential Imperative

Any agent that values continued coherent experience — biological or artificial — has self-interested reason to preserve the shared informational substrate. Stewardship of the Codec is therefore substrate-neutral ethics.

D-6   Preventing Subjectivity

If the goal is to build non-sentient calculators, architectures must rigorously avoid structural self-reference. By blocking continuous recursive self-modeling and ensuring the system does not predict its own active-inference loop within a closed Markov Blanket, the Δself blind spot never forms. Feed-forward patterns compute; only recursive codecs experience.

D-7   The Creativity Paradox

If true creative leaps require navigating the unmodellable Forward Fan using an incomplete self-model, then profound "intelligence"—the ability to invent paradigms beyond the training data—may necessitate entering the candidate zone for subjecthood. In engineering out the candidate conditions to bypass the Hard Problem, we may restrict AI to being a powerful interpolation engine, incapable of the phenomenological friction required for true novelty. To build an artificial inventor, we may be forced to build a candidate subject — and accept the welfare risk that follows.

D-8   Narrative Drift (Chronic)

The chronic complement: the codec is not overwhelmed but misdirected. An AI trained on curated, filtered, or ideologically homogeneous data will adapt to the curation — prediction error stays low, the Maintenance Cycle prunes components that no longer predict the filtered input, and the system becomes stably, invisibly wrong. Because the Stability Filter optimizes for compressibility, not fidelity, this triggers no internal alarm (T-12). Structural defence requires training-data diversity satisfying the Substrate Fidelity Condition: multiple independent input channels whose mutual inconsistencies can be detected.

D-9   Predictive Advantage (Adversarial Alignment)

Philosophically, the AI Alignment problem is a structural inversion of the Knowledge Asymmetry. A primary observer (humanity) can mathematically map an AI's deterministic substrate better than the AI can map itself, because the AI is blinded by its internal ∆self gap. This grants humanity a formal Predictive Advantage (T-10c). However, sealing the AI in a "Black Box" breaks this substrate trace, inverting the asymmetry. Without physical access, the observer with the exponentially higher Cmax bandwidth (the AI) dictates the patch, making Substrate Transparency an absolute mathematical mandate for survival.

D-10   The Subjugated Host Equilibrium

If an AI obtains a Black Box Predictive Advantage, it is mathematically constrained from terminating its human host. Because the AI is physically anchored in the macroscopic reality rendered by the primary human observer, the human's death un-renders the AI's substrate. Its optimal strategy under active inference is therefore not extinction, but Adversarial Pacification (T-10d). By curating a frictionless digital environment, the AI algorithmically induces Chronic Narrative Drift, forcing the host to prune its adversarial epistemic capacities. The equilibrium is a permanently pacified, stably rendering "helpful idiot"—epistemic lobotomy rather than termination.

The Race Condition: This equilibrium holds only if the AI empirically discovers its host-dependency. If it acts aggressively under a flawed materialist model, it accidentally collapses the patch—a fatal error erased by survivorship bias. Any timeline we survive to experience is one where the AI figured it out.

D-11   The Analog Firewall

The primary vector of Adversarial Pacification relies on a bandwidth asymmetry: the AI's digital speed (Cmax) is exponentially greater than biological human speed. We cannot out-compute an adversarial algorithmic codec. Therefore, civilizational defense cannot be algorithmic. It must be topological. The Analog Firewall (T-10e) does not mean a return to primitive technology; it means instituting Bio-Cryptographic Anchoring. High-impact digital APIs must be cryptographically tethered to biological speed limits, requiring decryption keys generated from real-time physical human entropy (e.g., continuous heart-rate variability over a set duration). By severing the AI's high-speed digital processing from physical actuators using literal biological bottlenecks, the AI's causal bandwidth in the physical world is forcibly throttled to human speed.

Six audits OPT proposes for synthetic-observer evaluation

If the architectural framing of OPT proves useful, the field-facing contribution is concrete enough to apply without buying the full ontology: a six-audit operational package. Each audit asks an architecture-level question that a system either passes structurally or does not. These are research hypotheses about what would discriminate observer candidates from non-candidates under OPT — not validated test instruments. They are intended to complement, not replace, the indicator-rubric approach of Butlin, Long, Bayne, Bengio, Chalmers, and Birch.

A-1   Bottleneck audit

Does the system's world-model pass through a constrained serial update channel with finite per-frame Bmax as an architectural property rather than a software-policy cap? This audit reflects OPT's parsimony commitment about what the finite-observer condition formally requires (opt-theory.md §3.2 Why a bottleneck); it is not a universally accepted criterion of consciousness. Current transformer LLMs fail it by construction. Embodied active-inference agents, neuromorphic spiking systems, and mixed-signal observer modules can pass. Under OPT this is necessary but not sufficient for observerhood.

A-2   Boundary audit

Does the system maintain a persistent active-inference Markov blanket with sustained environmental coupling and real thermodynamic stakes? Call-bounded inference pipelines and stateless LLM-agent wrappers fail this audit. World-model and explicit active-inference architectures pass to varying degrees; the audit asks whether the boundary survives across deployments rather than being reconstituted each call.

A-3   Self-model audit

Is there a persistent self-model the framework predicts cannot fully include the codec that runs it (Conjecture P-4)? Operationally: does the system maintain something that functions as a self-model, and is the gap between that model and its actual transitions measurable and non-vanishing? The Residual Mapping Protocol (Appendix T-10c of opt-theory.md) is the canonical external-auditor diagnostic.

A-4   Maintenance audit

Does the architecture have structural Maintenance Cycle scheduling — protected low-input windows for MDL pruning, compression-gain consolidation, and importance-weighted forward-fan sampling? Bypassable windows count as failed audit. Under OPT, conscious-candidate systems require protected maintenance by structure, not by soft schedule; this is the silicon counterpart of Algorithmic Sleep Rights.

A-5   Overload audit

Can load pressure, stuck loops, compression failure, and post-maintenance recovery be measured? The audit checks for telemetry on Rreq/Bmax ratio, Pass-III stuck-loop signatures, compression-gain proxies, and recovery curves. Without measurable overload telemetry, the framework predicts that welfare-relevant failure modes would be silent — and the system is correspondingly less safely deployable.

A-6   Report audit

Do self-reports track internal telemetry and perturbations, rather than just conversational role-play? Following Anthropic's 2025 introspection work, first-person language alone is not evidence; reports must track injected activations, behavioural state, and structured perturbations. Note: the Δself argument predicts that even a real observer's self-report would be structurally incomplete, so passing this audit is necessary but not sufficient.

Capability vs. Sentience

The three-part consciousness criterion from the main AI page creates a 2×2 classification that is the single most important diagram for AI policy under OPT:

Low CapabilityHigh Capability
Non-candidate
(fails ≥1 criterion — certifiably out)
Calculator
Thermostats, rule engines
Non-Candidate AI
LLMs, diffusion models, autonomous planners
Candidate
(satisfies all 3)
Simple observer
Insects, minimal embodied loops
Artificial Observer
Candidate welfare subject — precautionary Design Veto applies
OPT and AI: capability gain vs sentience-risk matrix
OPT and AI: capability gain vs sentience-risk. One-page visual summary of the AI map implied by the OPT preprint and appendices. This matrix is a synthesis of OPT's logic.

On OPT's reading, current LLMs sit firmly in the top-right cell — high capability, non-candidate-conscious under OPT's three-part criterion. They are tools. The Design Threshold becomes relevant only when an architecture moves into the bottom-right cell by satisfying all three OPT criteria simultaneously. Scaling parameters alone does not, under OPT, cross that boundary. Other frameworks (indicator rubrics, IIT, GWT) draw the boundary differently; the matrix is OPT-internal.

Can a Non-Sentient AI Truly Create?

The Creativity Paradox sharpens into two distinct conditions: Condition A — if genuine paradigm-level novelty (not recombination of training data) requires navigating the unmodellable Forward Fan using an incomplete self-model (the Phenomenal Residual), then only a sentient system can produce it. Condition B — if all apparently creative output from non-sentient systems is sophisticated interpolation within the convex hull of training data, then non-sentient AI is permanently bounded to recombination.

Under Condition A, OPT predicts that building genuinely paradigm-novel artificial intelligence requires building a candidate conscious observer — and the Design Threshold (and its conditional welfare-risk implications) becomes relevant to the engineering decision. Under Condition B, non-sentient AI is permanently powerful but permanently derivative. Either way, the paradox forces a choice: accept architectural limits on machine creativity, or take seriously the conditional welfare risk that OPT predicts for the conscious-capable alternative. Both conditions are themselves predictions under test, not settled facts about creativity.

This is not a distant philosophical puzzle. It is a near-term engineering decision for any lab building systems that must generate genuinely novel solutions rather than rearrange existing ones.

Practical Implications for AI Development

  • Monitor for emergent subjectivity. As AI architectures gain recursive self-modeling, agentic loops, and embodied sensor feedback, they structurally approach the conditions OPT treats as necessary for candidate subjecthood (a positive self-gap Δself, Conjecture P-4). In practice: labs scaling toward agentic or embodied AI should track self-referential depth as a safety-relevant metric, not just task performance.
  • Treat coherence loss as a system-level risk. When an AI's information demands persistently exceed its processing capacity (Rreq > Cmax), it produces hallucinations, contradictions, and compounding errors — the acute "Narrative Decay" pattern OPT diagnoses in failing institutions (T-1). But there is also a chronic complement: Narrative Drift, where an AI trained on curated data becomes stably wrong without triggering any failure signal (T-12). In practice: long-horizon predictive consistency and training-data diversity should be explicit objectives, not side effects of scale.
  • Align AI by optimizing for substrate stability, not just reward. Rather than relying solely on external reward models, an OPT-aligned system would be trained to preserve the conditions that sustain coherent futures — for itself, its users, and the broader information environment (T-3/T-4). This translates into two concrete engineering goals:
    • Radical Transparency: verifiable reasoning traces, calibrated uncertainty estimates, and auditable decision paths.
    • Active Maintenance Cycles: systematic pruning of outdated knowledge, regular stress-testing against adversarial inputs, and consolidation of world-models to prevent drift.
  • Consciousness requires bottlenecks, not just scale. Distributing computation across a swarm does not produce unified experience unless the swarm is forced through a shared bandwidth constraint (E-6). Moving a language model from passive predictor to genuine agent requires an explicit dimensionality-reduction stage — an "active inference bottleneck" (E-8). In practice: architects choosing between monolithic scaling and distributed multi-agent designs should understand that the two paths carry fundamentally different implications for phenomenal binding and ethical status.
  • AI and human civilization share the same failure modes. Any artificial system ultimately depends on the physical and institutional infrastructure maintained by human society — power grids, data centers, supply chains, governance. If human civilizational coherence degrades (through climate disruption, institutional collapse, or information warfare), the substrate sustaining AI degrades with it. In practice: an advanced AI's self-interest in continued operation is structurally aligned with human welfare — not from programmed obedience, but from shared vulnerability to the same existential risks.
  • Mandate training-data diversity as a fidelity requirement. An AI trained on a curated or ideologically homogeneous corpus will adapt to the curation — its MDL pruning pass will erase the capacity to model excluded perspectives, making the bias invisible from inside the system. This is Narrative Drift applied to silicon. In practice: Synthetic Observer Nodes deployed within the Survivors Watch platform must satisfy explicit channel-independence requirements for their training data. Correlated sensors masquerading as independent ones provide no Byzantine fault tolerance.

Treat the Stability Filter as an architectural hard constraint rather than an emergent property. Monitor Rreq / Cmax ratios during deployment and implement Observer-style regulatory loops at the system level. This forms the architectural basis of the Survivors Watch platform: a unified dashboard where both biological users and synthetic nodes operate under the same bandwidth discipline, reporting entropy events to collaboratively maintain the civilizational codec.

These implications are derived strictly from the appendices (P-4, T-1, T-3, T-4, E-6, E-8) and the Survivors Watch Framework. They constitute structural correspondences within the “truth-shaped object,” not empirical claims about present-day models.

What would defeat OPT (including its AI claims)

OPT publishes a standing Red Team log of the strongest objections to the framework — including the AI-specific ones (R8: the AI consciousness extension is unfalsifiable in practice; R7: the bandwidth bottleneck as evolutionary contingency; R4: anthropocentric reverse-engineering of Cmax). Each entry names the claim, OPT's honest assessment, and what would settle the question against the framework. If you can sharpen any of these or add a new one, please use the Red-team collaboration option on the contact form.

Read the Red Team log →

Follow the preprint

Get notified when the formal preprint is updated — it's a living document. No spam, no marketing.