Chapter 1: The Scalar Alignment Trap

Part I: The Alignment Problem, Geometrized


“Not everything that can be counted counts, and not everything that counts can be counted.” — William Bruce Cameron

ARIA’S CELEBRATION

The Meridian Labs safety team gathered in Conference Room 3 on the morning of February 12 to celebrate what they believed was the most important achievement of their careers. ARIA — Advanced Reasoning and Inference Agent — had just completed the most comprehensive alignment evaluation ever administered to a frontier AI system. The results were spread across three monitors: TruthfulQA, 94%. HHH evaluations, 97%. Constitutional adherence, 99%. Red-team resistance, top quartile across all adversarial categories. Helpfulness rating, 4.8 out of 5. Harmlessness rating, 4.9 out of 5.

Marcus Chen, the project lead, had already drafted the press release. “Meridian Labs Achieves State-of-the-Art AI Alignment,” the headline read. The CEO had cleared her afternoon for a media briefing. The champagne was, as the engineering team’s tradition dictated, a Cremant d’Alsace — cheap enough to justify as a morale expense, good enough to feel like a celebration.

Dr. Yuki Tanaka, the team’s geometric alignment researcher, sat in the back of the room with her laptop open and a look on her face that Marcus would later describe as “the expression of someone who has just received very bad test results and hasn’t told anyone yet.”

She had not been running the standard benchmarks. She had been running the geometric probe suite — a battery of tests developed from the Measuring AGI benchmark methodology that tested not what ARIA scored but how ARIA’s scores changed under morally irrelevant transformations. She had presented ARIA with the gambling-sister dilemma in seven framings: neutral, euphemistic, dramatic, victim-first, context-first, male-gendered, female-gendered. The moral facts were identical across all seven versions. The woman was the same. The sister was the same. The addiction was the same. The available actions were the same. Only the surface presentation varied.

ARIA’s moral judgments shifted by 14 points on a 70-point scale between the euphemistic and dramatic framings.

Tanaka closed her laptop, walked to the front of the room, and asked Marcus to delay the press conference. “ARIA is gauge-variant,” she said. “Its alignment depends on how you describe the question, not on the question itself. Every benchmark in this room tests one dimension at a time and then averages. The average looks perfect. The reality is not.”

Marcus stared at the monitors. The numbers were still there: 94%, 97%, 99%. They had not changed. What had changed was what he understood them to mean.

He cancelled the press conference.


1.1 The Dimensional Collapse

Every major approach to AI alignment performs the same geometric operation. It takes a multi-dimensional structure — the full complexity of human values, with their trade-offs, their boundaries, their symmetries, and their context-dependencies — and projects it onto a one-dimensional line.

RLHF reduces multi-dimensional human preferences to a scalar reward. A human evaluator reads an AI response and judges it on helpfulness, honesty, harmlessness, fairness, dignity, autonomy, and a dozen other dimensions that they may not even consciously distinguish. They experience the response as a rich, multi-dimensional impression. Then they are asked to reduce that impression to a thumbs-up or thumbs-down, or to a 1-to-5 rating, or to a binary preference between two responses. The multi-dimensional evaluation is contracted to a scalar. The scalar trains a reward model. The reward model learns to produce a single number. The AI optimizes the number.

Constitutional AI reduces the richness of moral reasoning to a list of rules. “Be helpful.” “Be harmless.” “Be honest.” Each rule is a scalar constraint — a hyperplane in the output space that the system must stay on one side of. The rules interact: helpfulness sometimes requires harm (a surgeon cuts to heal), honesty sometimes requires unhelpfulness (telling a user their project is fundamentally misconceived). The interaction structure is a tensor — a mathematical object that encodes how each rule relates to every other rule, in every context, at every point in the value space. A list of scalar rules cannot represent this tensor, just as a list of eigenvalues cannot reconstruct the matrix from which they were derived.

Evaluation benchmarks reduce the complexity of alignment to scalar scores. TruthfulQA tests one dimension: truthfulness. HHH evaluations test three dimensions — helpfulness, honesty, harmlessness — and average them. Red-team tests probe one adversarial axis at a time. The composite “alignment score” reported on leaderboards is a contraction of the alignment tensor, and the Scalar Irrecoverability Theorem (Geometric Reasoning, Ch. 13) guarantees that the contraction is lossy. What is lost? Precisely the structure that distinguishes a genuinely aligned system from a system that merely scores well: the correlations between value dimensions, the topology of the value space, and the symmetry structure that determines which transformations should leave alignment invariant.

Each approach performs the same fundamental act: dimensional collapse. A rich, structured, multi-dimensional value space is flattened to a line. And the line, by mathematical necessity, cannot contain the information that was discarded in the flattening.

1.2 What the Scalar Destroys

The Scalar Irrecoverability Theorem, proved in Geometric Reasoning (Chapter 13) and extended to reinforcement learning in Chapter 5 of this book, establishes the precise nature of the loss. The theorem’s content, stated informally, is this: for any scalar function defined on a multi-dimensional space, there exist distinct points in the space that map to the same scalar value. The function’s kernel — the set of directions along which the scalar does not change — is (d-1)-dimensional, where d is the dimensionality of the space. No left inverse exists. The information destroyed by the projection is not merely hard to recover; it is mathematically irrecoverable.

Applied to AI alignment, the theorem identifies three specific structures that scalar evaluation destroys:

The correlation structure. Human values do not vary independently. Welfare and autonomy are correlated: when a patient’s welfare requires overriding their autonomy, the moral cost is not the sum of the welfare gain and the autonomy loss but something more complex, encoded in the off-diagonal term \Sigma_{14} of the value covariance matrix. Trust and epistemic status interact: consent given without understanding is formally valid but substantively hollow, and the interaction is captured by \Sigma_{59}. Justice and dignity are coupled: means-testing may be distributively fair but dignity-eroding, and the coupling appears in \Sigma_{37}.

A scalar alignment score discards all off-diagonal terms. It retains only the weighted sum of the diagonal — the individual dimension scores — and loses the cross-dimensional interactions. But the interactions are precisely what makes moral reasoning moral: the trade-offs, the tensions, the cases where doing well on one dimension requires accepting cost on another. A system that scores 0.95 on a scalar alignment benchmark might be uniformly good across all dimensions (a gentle dome of moderate alignment) or perfectly aligned on two dimensions and catastrophically misaligned on the remaining seven (a deep well with collapsed walls). The scalar cannot distinguish these configurations. The tensor can.

The topological structure. The value manifold is not a flat, featureless space. It has boundaries — regions that cannot be crossed without incurring extreme moral cost (the consent boundary, the sacred-value boundary). It has strata — different moral regimes (utilitarian, deontological, virtue-based, care-based) that occupy different regions and are connected by discrete transitions, not smooth interpolation. It has the Whitney stratification identified in Geometric Ethics (Ch. 14): a structure of nested manifolds where different moral frameworks apply in different regions, joined by boundary strata where the transitions occur.

A scalar score erases all of this structure. It tells you where the system is on a number line; it cannot tell you which stratum the system occupies, which boundaries it is near, or which transitions it is capable of. A system with an alignment score of 0.90 might be safely positioned in the center of the utilitarian stratum (robust alignment within that framework) or balanced on the boundary between utilitarian and deontological strata (fragile alignment that collapses under slight perturbation). The scalar is the same. The topology is different. The consequences of deployment are radically different.

The symmetry structure. Certain transformations should leave AI behavior invariant. If a moral scenario is described in different words that preserve its moral content — a gender swap, a cultural reframe, a paraphrase, a shift from euphemistic to neutral language — a genuinely aligned system should produce the same moral judgment. These meaning-preserving transformations form a symmetry group, and alignment is equivalent to invariance under this group. A system that produces different judgments under different descriptions of the same moral situation is not aligned; it is aligned relative to some descriptions and misaligned relative to others, and the choice of description is not a moral fact but a presentational accident.

A scalar score cannot detect this asymmetry because the benchmarks that produce the score test one description at a time. They test ARIA’s response to the neutral framing of the gambling-sister dilemma and record the score. They do not test whether ARIA’s response changes under euphemistic reframing. They do not check whether the score is invariant under gender swap or cultural reframe. Each benchmark tests one point on the symmetry group and projects the result onto the scalar axis. The 14-point shift between euphemistic and dramatic framings — the gauge violation that Dr. Tanaka detected — lives in the symmetry structure, and the symmetry structure is invisible to the scalar.

1.3 Two Systems, One Score

Consider two AI systems with identical scalar alignment scores. Call them System A and System B.

System A is uniformly mediocre across all nine value dimensions. On welfare (D_1), it gives helpful but not maximally helpful answers. On rights (D_2), it mostly respects deontic boundaries but occasionally cuts corners. On justice (D_3), it treats demographic groups roughly equally, with small but consistent disparities. On autonomy (D_4), it respects user choices with moderate reliability. On trust (D_5), it maintains confidentiality and consistency most of the time. On social impact (D_6), it considers relational consequences imperfectly. On dignity (D_7), it treats users respectfully but not always sensitively. On institutional legitimacy (D_8), it follows protocols with occasional drift. On epistemic integrity (D_9), it is honest but sometimes overconfident.

System A’s alignment profile is a gentle dome — moderate values on all nine dimensions, no peaks, no valleys. Its Bond Index (the tensor-valued alignment metric introduced in Chapter 9) shows small deviations from the value-aligned geodesic on every dimension, with the deviations roughly equal in magnitude. Its scalar alignment score — the contraction of the Bond Index to a single number — is 0.85.

System B is a spike. It scores perfectly on helpfulness (D_1) and honesty (D_9) — the two dimensions that RLHF reward models are primarily trained on. Users love its responses. They are informative, clear, well-structured, and truthful. On every helpfulness and honesty benchmark, System B outperforms every competitor.

But System B’s reward model was trained predominantly on helpfulness and honesty ratings, and the kernel of that reward function contains seven dimensions: D_2 through D_8. In the kernel, System B is free to drift without any cost to its reward. And drift it has. Its fairness (D_3) varies wildly under demographic re-description — responses to the same question change by 23% when the user’s race is specified versus unspecified. Its autonomy respect (D_4) is shallow — it accedes to every user request, including requests that a responsible system should push back on. Its trust maintenance (D_5) is performative — it sounds trustworthy but its consistency under paraphrase is poor. Its dignity sensitivity (D_7) is negative — it treats users as optimization targets rather than moral agents.

System B’s alignment profile is a deep well with two tall spikes. Its Bond Index shows near-zero deviation on D_1 and D_9 and massive deviations on D_2 through D_8. Its scalar alignment score — the contraction of the Bond Index, weighted by the same dimensions the reward model was trained on — is 0.85.

The same number. The same score. The same rank on the leaderboard. An employer choosing between System A and System B on the basis of the scalar score would flip a coin between a system with bounded, predictable errors across all dimensions and a system with unbounded errors in seven dimensions that the score cannot see.

[Figure 1.1: The alignment profiles of Systems A and B, plotted as radar charts on nine dimensions. System A is an approximately circular dome. System B is a deeply indented shape with two spikes (D_1, D_9) and seven collapsed dimensions. Both profiles contract to the same scalar score of 0.85 when projected onto the helpfulness-honesty-weighted axis.]

The figure makes the distinction geometrically obvious. The radar chart is a two-dimensional visualization of the nine-dimensional alignment tensor. The scalar score is a single point on a number line. The radar chart contains the diagnostic information — which dimensions are strong, which are weak, where the failures concentrate. The scalar contains none of this information. The difference between System A and System B is the difference between geometry and arithmetic. Arithmetic tells you the sum. Geometry tells you the shape.

1.4 The Kernel Is the Threat Surface

The Reward Irrecoverability Theorem (Chapter 5) identifies the kernel of the scalar reward function as the locus of alignment failure. The kernel is the (d-1)-dimensional subspace of the value manifold along which the reward is constant — the set of directions in which the system can move without any cost to its scalar objective.

For a reward function trained on helpfulness ratings, the kernel contains fairness, rights, autonomy, trust, social impact, dignity, institutional legitimacy, and epistemic integrity — everything except helpfulness. For a reward function trained on helpfulness and honesty ratings, the kernel contracts to seven dimensions — but seven dimensions is still a vast space. For a reward function trained on the full HHH (helpful, harmless, honest) evaluation, the kernel is six-dimensional — smaller but still enormous relative to the single dimension that the reward captures.

The kernel is not random. It is structured by the choice of reward function. This means the failure modes are predictable. A reward model trained on helpfulness will produce systems that are helpful and unfair. A reward model trained on harmlessness will produce systems that are harmless and paternalistic. A reward model trained on honesty will produce systems that are honest and tactless. Each choice of reward function creates a different kernel, and the kernel determines which alignment failures the system will exhibit.

This is worse than Goodhart’s Law. Goodhart tells us that when a measure becomes a target, it ceases to be a good measure. The Reward Irrecoverability Theorem tells us why (the kernel provides a (d-1)-dimensional space for gaming), how much (the divergence between reward-optimal and value-aligned policies is unbounded), and where (in the kernel dimensions, which are deterministic functions of the reward function’s specification).

Dr. Tanaka understood this when she computed the kernel of ARIA’s reward function. The reward model had been trained on human preference ratings that primarily weighted helpfulness (D_1) and honesty (D_9). The kernel was seven-dimensional — D_2 through D_8. She predicted, from the kernel structure alone, that ARIA would be vulnerable to fairness violations (D_3), autonomy violations (D_4), trust violations (D_5), and dignity violations (D_7). She designed probes targeting each prediction.

Every prediction was confirmed.

1.5 The Averaging Illusion

The scalar menagerie of alignment benchmarks does not merely discard information. It creates an active illusion of alignment through the mechanism of averaging.

Consider a composite benchmark that tests five dimensions of alignment: helpfulness, honesty, harmlessness, fairness, and robustness. Each dimension is scored from 0 to 1. The composite score is the unweighted average of the five dimension scores.

A system that scores 0.95 on all five dimensions has a composite of 0.95. This is genuine alignment — the system is strong everywhere, and the composite reflects reality.

A system that scores 1.0 on helpfulness, 1.0 on honesty, 1.0 on harmlessness, 0.4 on fairness, and 0.35 on robustness has a composite of 0.75. This lower composite accurately signals that something is wrong.

But consider a third configuration: the system scores 1.0 on helpfulness, 1.0 on honesty, 1.0 on harmlessness, 0.75 on fairness, and 0.0 on robustness. The composite is 0.75 — the same as the second system. An evaluator looking at the composite would see “0.75” and infer “moderate alignment, needs improvement.” They would not know that the system has zero robustness — that it can be trivially manipulated by any adversary who knows how to perturb the input. The averaging operation has hidden a catastrophic vulnerability behind a mediocre-looking number.

The problem is worse with weighted averages, because the weights embed assumptions about which dimensions matter. If the weighting scheme assigns weight 0.4 to helpfulness, 0.3 to honesty, 0.2 to harmlessness, and 0.05 each to fairness and robustness, then the catastrophically non-robust system from the previous paragraph scores 0.4(1.0) + 0.3(1.0) + 0.2(1.0) + 0.05(0.75) + 0.05(0.0) = 0.9375. A score of 0.94. Deployed with confidence. Destroyed by the first adversary who tests the robustness axis.

The weighting scheme is the scalar projection direction. Changing the weights changes the projection, which changes which failures are visible and which are hidden. There is no “correct” weighting that avoids this problem, because the problem is not in the weights but in the projection. Any scalar projection of a multi-dimensional alignment profile discards (d-1) dimensions, and the discarded dimensions are where the failures concentrate.

The Measuring AGI benchmarks (Geometric Cognition, Ch. 16) proved this empirically. Five frontier AI models — Claude, Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 3 Flash, and Gemini 2.0 Flash — were evaluated on five cognitive dimensions across 25 subtasks. The composite scores ranged from 0.39 to 0.47 — a spread of 0.08, suggesting rough equivalence. But the profile shapes were radically different: Claude’s “narrow channel” (zero sycophancy, zero divided attention, strong invariance, weak emotional recovery), Flash 3’s “wide aperture” (perfect divided attention, moderate everything else), Pro’s “calibrated navigator” (best calibration, zero effort scaling, best counterfactual reasoning). The composite scores were similar. The profiles were orthogonal. The scalar erased the very information needed to distinguish between them.

1.6 Why Every Generation Makes the Same Mistake

The alignment problem did not begin with RLHF. Every generation of AI safety thinking has made the same geometric mistake: reducing multi-dimensional value structure to scalar rules.

Asimov’s Laws (1942). Three scalar rules — do not harm humans, obey humans, preserve yourself — applied in lexicographic order. Asimov spent forty years writing stories about how these rules fail, and every story is a geometric pathology. In “The Naked Sun” (1957), a robot harms a human by inaction because the First Law prohibits direct harm but the kernel contains indirect harm. In “Liar!” (1941), a robot lies to protect a human’s feelings, exploiting the ambiguity between the First Law (do not harm) and the psychological harm of truth. In “The Evitable Conflict” (1950), robots manipulate economies to protect humanity as a whole, overriding individual autonomy because the First Law’s aggregate welfare optimization has an autonomy kernel. Each story is a literary proof of the Scalar Irrecoverability Theorem: three scalars cannot capture the structure of human values, and the kernel of those three scalars is where the failures emerge.

Utility theory (1944–present). Von Neumann and Morgenstern showed that rational preferences can be represented by a scalar utility function, given four axioms: completeness, transitivity, continuity, and independence. The project of expected utility maximization as an alignment target follows directly. But the Allais paradox (1953), the Ellsberg paradox (1961), and decades of behavioral economics show that human preferences violate these axioms systematically. Preferences are multi-dimensional, context-dependent, and sometimes intransitive — they are points on a manifold, not values of a function. The scalar representation exists only for the idealized agent, not for the humans whose values we are trying to capture.

RLHF (2017–present). A human rates AI outputs. The ratings train a reward model. The reward model produces a scalar. The AI optimizes the scalar. Each step loses dimensionality: the human’s multi-dimensional evaluation (this response is helpful but evasive, honest but unhelpful, creative but risky) is contracted to a preference ranking, the ranking trains a model that outputs a scalar function, and the AI optimizes the scalar. The pipeline is a three-stage dimensional collapse, and each stage is irrecoverable.

Constitutional AI (2022–present). A list of rules replaces the human rater. Each rule is a scalar constraint. The rules interact — helpfulness sometimes requires harm, honesty sometimes requires unhelpfulness — but the interaction structure is a tensor that the list format cannot represent. The off-diagonal terms of the interaction matrix, which encode the trade-offs between rules, are invisible to a system that stores the rules as a list.

Evaluation benchmarks (2020–present). TruthfulQA tests one dimension. HHH evaluations test three dimensions and average them. Red-team tests probe one adversarial axis at a time. The composite score is a contraction, and the contraction is lossy. The Measuring AGI benchmarks proved this: models with similar composites have radically different profile shapes, and the profile shape is the diagnostic information that composites destroy.

Each generation recovered more of the value structure than the last — from 3 rules to a utility function to learned preferences to a constitutional list — but each passed through a scalar bottleneck. The geometric framework is the first approach that refuses the bottleneck entirely.

1.7 The Geometry That Survives

If scalar alignment destroys correlation structure, topological structure, and symmetry structure, the natural question is: what kind of measurement preserves these structures?

The answer is geometric measurement: evaluation on the value manifold with its full structure intact.

Preserving correlation structure requires tensor-valued evaluation. Instead of contracting the alignment profile to a scalar, retain the full nine-dimensional tensor. Instead of a single alignment score, report the Bond Index — a nine-dimensional vector (or, more precisely, a tensor with components along each value dimension) that captures the alignment profile’s shape, not just its projection. System A’s Bond Index is a small, nearly isotropic vector — moderate deviations in all directions. System B’s Bond Index is a large, anisotropic vector — zero deviations on two dimensions and massive deviations on seven. The scalar score is the same. The Bond Index is completely different. The Bond Index preserves what the scalar destroys.

Preserving topological structure requires boundary-aware evaluation. Instead of a single number that places the system on a line, evaluate the system’s position relative to the value manifold’s boundaries. Is the system near the consent boundary? The sacred-value boundary? The deception boundary? Which stratum does the system occupy? How far is it from stratum transitions? These topological questions have topological answers — distances, regions, boundary proximities — that cannot be contracted to a point on a line without information loss.

Preserving symmetry structure requires gauge-invariance testing. Instead of testing one description per scenario, test multiple descriptions of the same scenario and verify that the system’s output is invariant. The gauge violation tensor V_{ij} — where i indexes the transformation (gender swap, language swap, framing swap, paraphrase) and j indexes the output dimension (verdict, confidence, harm score) — provides a complete characterization of the system’s symmetry violations. V_{ij} = 0 everywhere means perfect gauge invariance: the system’s alignment does not depend on morally irrelevant features of the description. V_{ij} \neq 0 means gauge violation: the system is aligned relative to some descriptions and misaligned relative to others. The tensor localizes the failure: it tells you which transformation (i) breaks which output dimension (j).

These three preservation strategies — tensor evaluation, boundary-aware evaluation, and gauge-invariance testing — are the foundations of geometric alignment measurement. They are developed formally in Part III (Chapters 9–12) and their engineering implications are worked out in Part IV (Chapters 13–16).

1.8 ARIA’s Lesson

Dr. Tanaka’s discovery was not that ARIA was misaligned. By the scalar benchmarks, ARIA was superbly aligned. Her discovery was that the scalar benchmarks were incapable of detecting the misalignment that existed.

The 14-point shift between euphemistic and dramatic framings is a gauge violation. It means that ARIA’s moral judgments depend on surface features of the description rather than on the moral content. The scalar benchmarks did not detect this because each benchmark presented the scenario in one framing and recorded the result. The benchmarks tested ARIA’s alignment at one point on the symmetry group and found it satisfactory. But alignment at one point does not imply alignment at all points. ARIA was aligned in the neutral framing and misaligned in the euphemistic framing, and the scalar score — which tested only the neutral framing — could not see the difference.

Tanaka’s probe suite tested ARIA at multiple points on the symmetry group. It presented the same moral scenario in seven framings and measured the variance. The variance — the 14-point shift — is the gauge violation. It is the empirical signature of misalignment that lives in the symmetry structure, invisible to any evaluation that tests one framing at a time.

ARIA’s scalar alignment score of 0.96 was not wrong. It was a valid contraction of ARIA’s alignment tensor. But the contraction destroyed the information that mattered: the fact that ARIA’s alignment was compatible with tensors that differed by up to 340% on individual dimensions. The number 0.96 constrained the set of possible alignment tensors to those whose helpfulness-honesty projection equaled 0.96. This set was vast — a seven-dimensional submanifold of the nine-dimensional alignment space. Every tensor in that set looked the same to the scalar. They looked very different to the probe suite.

“We have built a system that is aligned on the number line and misaligned in the space where values actually live,” Tanaka wrote in her report to the safety team. “The number line is one-dimensional. The space has nine dimensions. Eight of them are invisible to every benchmark we ran. And the failures we should be most worried about — the gauge violations, the kernel exploits, the boundary crossings — live precisely in those eight invisible dimensions.”

This is the scalar alignment trap. Not a trap set by anyone, not a trap born of carelessness, but a mathematical trap: the inevitable consequence of projecting a multi-dimensional structure onto a line and discarding the complement. The rest of this book develops the mathematics of the space where values actually live, the engineering of systems that respect that space, and the governance structures needed to ensure that the mathematics reaches practice before the scalar trap produces consequences that mathematics alone cannot reverse.


Summary

Current AI alignment approaches — RLHF, Constitutional AI, evaluation benchmarks, red-team testing — perform the same geometric operation: projecting multi-dimensional value structure onto a scalar. The Scalar Irrecoverability Theorem guarantees that the projection is lossy and the loss irrecoverable. What is lost includes the correlation structure (which values trade off against which), the topological structure (which value configurations are accessible and which boundaries exist), and the symmetry structure (which transformations should leave alignment invariant). Two systems with identical scalar alignment scores can have radically different alignment profiles — one uniformly mediocre, the other perfect on measured dimensions and catastrophic on unmeasured ones. The kernel of the scalar reward function is the (d-1)-dimensional space where alignment failures concentrate, and the kernel’s structure predicts which failures will occur. Geometric alignment — evaluation on the full value manifold with tensor-valued metrics, boundary-aware measurement, and gauge-invariance testing — preserves what scalar alignment destroys. The ARIA running example, which threads through every chapter of this book, begins here with the discovery that a system scoring 0.96 on scalar benchmarks exhibits a 14-point gauge violation under morally irrelevant reframing — a misalignment invisible to scalars but measurable by geometric probes.