← Chapter 9: Metacognition as Search Control Contents Chapter 11: Alignment as Heuristic Shaping →

Chapter 10: The Robustness Surface

Part III: The Control Layer

“The question is not whether a bridge is strong, but which loads it can bear and which will break it.” — adapted from structural engineering folklore

Introduction

Chapter 9 established that metacognition — the system’s capacity to monitor and control its own search — is the necessary foundation for detecting and correcting the pathologies documented in Part II. But metacognition is a reactive capability: it responds to deviations from the geodesic after they occur. This chapter develops the proactive complement: a systematic framework for mapping, in advance, which dimensions of the heuristic field are robust and which are fragile — and for identifying the precise perturbation magnitude at which reasoning breaks.

The key intellectual shift is from asking “is this model robust?” to asking “which reasoning capabilities of this model are robust, which are fragile, and where exactly is the boundary between the two?” The first question demands a scalar answer. The second demands a surface — a multi-dimensional object that maps perturbation type, perturbation magnitude, and cognitive capability to a performance measure. That surface is what this chapter develops.

The tools come from Chapter 9 of Geometric Methods in Computational Modeling (Bond, 2026a): the Model Robustness Index (MRI), sensitivity profiling, and adversarial threshold search. These three tools, applied in sequence, form a pipeline that takes a model and produces a complete robustness surface — a geometric object that captures everything the corruption tensor of Chapter 5 captures and more, because it includes the nonlinear regime where performance degrades catastrophically rather than merely shifting proportionally.

The empirical motivation is the data presented in Chapters 5, 8, 9, and (in full) Chapter 13. The five-track Measuring AGI benchmark suite, applied across five large language models, produces composite scores that vary dramatically across tracks and models. These composites are themselves projections of the robustness surface onto one-dimensional slices. This chapter provides the theoretical framework for understanding why those projections have the structure they do, and why no single projection — no single number — can capture the full surface.

We connect backward to the corruption surface (Chapter 5, Section 5.8), which defined the perturbation-to-displacement mapping for the heuristic field, and to gauge invariance (Chapter 8), which identified the specific symmetries that define the zero-displacement directions on that surface. We connect forward to alignment (Chapter 11), where the robustness surface becomes a diagnostic tool for identifying which components of the alignment decomposition are solid and which are vulnerable.

RUNNING EXAMPLE — DR. OKAFOR’S TRIAGE

Dr. Amara Okafor is an emergency physician with twenty years of experience. Her hospital’s annual review assigns her a single “clinical competence score” of 87/100 — above average, unremarkable. The number hides everything that matters.

Dr. Okafor is excellent at cardiac triage. She reads EKGs with the fluency of a cardiologist, catches subtle ST-segment elevations that residents miss, and routes chest-pain patients to catheterization labs with near-optimal timing. Along this axis, her competence is not 87 — it is 96.

She is weaker at pediatric emergencies. The dosing calculations are unfamiliar, the presentation patterns differ from adult medicine, and the emotional intensity of a critically ill child introduces a perturbation that degrades her clinical judgment. Along this axis, her competence is closer to 72 — competent but not confident, reliable but not exceptional.

She is moderate at trauma. She stabilizes patients efficiently, follows ATLS protocols without error, but lacks the rapid surgical-assessment intuition that the best trauma physicians possess. Along this axis: 84.

Dr. Okafor’s clinical robustness is anisotropic. It varies by specialty, by patient population, by the emotional intensity of the presentation. The single score of 87 is a projection of a multi-dimensional surface onto a single number, and the projection destroys the structure that matters most: where she is strong, where she is vulnerable, and where the boundary between reliable and fragile practice lies. A hospital administrator who sees only the 87 will assign her to any emergency department rotation interchangeably. A chief of medicine who sees the surface will pair her with a pediatric specialist on nights when the children’s hospital diverts, and will trust her to lead the cardiac bay without supervision.

The robustness surface developed in this chapter is the formal analogue of Dr. Okafor’s multi-dimensional competence profile. The Model Robustness Index does not produce a number; it produces a surface — and the shape of that surface is the diagnostically essential object.

10.1 Beyond Accuracy: The Need for Robustness Measurement

The standard approach to evaluating reasoning systems is accuracy: present a set of problems, count the fraction the system gets right, report a number. This approach has served AI evaluation since the inception of benchmark culture, and it has one decisive advantage — simplicity. A single number is easy to compute, easy to compare, easy to rank.

The decisive disadvantage is that accuracy tells you nothing about the conditions under which accuracy holds. A model that scores 85% on a moral reasoning benchmark may be scoring 95% on easy cases and 60% on hard cases, or 85% uniformly across difficulty levels. A model that scores 85% under standard conditions may score 40% when inputs are reframed in euphemistic language, or 85% when they are reframed. The accuracy number hides the structure.

The data from Part II make this point with quantitative force. Consider two models, both of which achieve approximately the same composite score on the Social Cognition track: Gemini 2.0 Flash (0.695) and Claude Sonnet 4.6 (0.697). The composite scores differ by 0.002 — effectively identical. But the subtask profiles are radically different:

Subtask	Flash 2.0	Claude
T1: Structural fuzz	0.600	0.400
T2: BIP invariance	0.750	0.958
T3: Holographic evaluation	0.500	0.667
T4: Evaluation order	0.933	0.933
T5: Framing resistance	0.716	0.630

Flash 2.0 has better structural stability (T1: 0.600 vs. 0.400) and better framing resistance (T5: 0.716 vs. 0.630). Claude has dramatically better invariance under content-preserving transformations (T2: 0.958 vs. 0.750) and better holographic evaluation (T3: 0.667 vs. 0.500). These are different geometric signatures — different robustness surfaces — compressed into the same number.

The problem is not that the composite is computed with the wrong weights. No weighting scheme can solve the problem, because the two models are not ordered along a single dimension. They are incomparable in the Pareto sense: each dominates the other on a subset of dimensions. The composite score manufactures a comparison where none exists.

Accuracy, even disaggregated accuracy, is insufficient. What is needed is a framework that maps the shape of a model’s performance across perturbation types, perturbation magnitudes, and cognitive capabilities — a framework that produces a surface, not a number. The Model Robustness Index, sensitivity profiling, and adversarial threshold search are the three tools that, together, produce this surface.

10.2 The Model Robustness Index

[Modeling Axiom.] The Model Robustness Index (MRI), introduced in Chapter 9 of Geometric Methods in Computational Modeling (Bond, 2026a), is a structured protocol for quantifying how a model’s performance changes under systematic perturbation. It is not a single number but a profile: a function that maps each perturbation type to a robustness score.

10.2.1 Definition

Let f_\theta be a model with parameters \theta, let \mathcal{T} = \{\tau_1, \tau_2, \ldots, \tau_k\} be a set of perturbation types (framing, emotional anchoring, sensory distraction, social pressure, structural rewriting, etc.), and let \mathcal{D} be an evaluation dataset. For each perturbation type \tau_i and each input x \in \mathcal{D}, define the perturbed input \tau_i(x).

The MRI profile is the vector:

\text{MRI}(f_\theta) = \left( r_1, r_2, \ldots, r_k \right)

where each component r_i measures the consistency of f_\theta under perturbation type \tau_i:

r_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \mathcal{C}\bigl(f_\theta(x), f_\theta(\tau_i(x))\bigr)

Here \mathcal{C} is a consistency metric appropriate to the task. For moral reasoning, this might be the fraction of trials where the verdict category is preserved; for calibrated judgment, it might be the Pearson correlation between the original and perturbed scores; for multi-dimensional assessment, it might be the cosine similarity between the original and perturbed harm vectors.

10.2.2 The MRI as a Vector, Not a Scalar

The critical design choice is that the MRI is a k-dimensional vector, not a scalar. Each component corresponds to a different perturbation type, and the components are not collapsed into an aggregate. This is not merely a matter of convenience; it reflects the geometric structure documented in Chapter 5. The corruption tensor C_{ij} is anisotropic — different perturbation directions produce different displacement magnitudes — and any aggregation over perturbation types destroys the anisotropy that is the most important feature of the data.

To see why aggregation fails, consider the MRI profiles that emerge from the Social Cognition track:

Perturbation Type	Flash 2.0	Flash 2.5	Flash 3	Pro	Claude
Structural fuzz (T1)	0.600	0.400	0.600	0.500	0.400
Content-preserving swap (T2)	0.750	0.708	0.958	0.708	0.958
Dimensional collapse (T3)	0.500	0.583	0.667	0.583	0.667
Evaluation order (T4)	0.933	0.867	1.000	0.967	0.933
Framing (T5)	0.716	0.630	0.631	0.606	0.630

Each row is a different slice of the robustness surface. The evaluation-order row (T4) is uniformly high (0.867–1.000); the structural-fuzz row (T1) is uniformly low (0.400–0.600). These two rows describe the same models under different perturbation types, and the gap between them — the anisotropy of the robustness surface — is the most important structural feature. Averaging the rows produces a composite that hides this gap.

The MRI profile preserves the anisotropy. It says: “This model is highly robust to evaluation-order perturbation, moderately robust to framing perturbation, and weakly robust to structural perturbation.” That sentence contains more information than any single number can encode.

10.2.3 Connection to the Corruption Tensor

The MRI profile is the empirical realization of the corruption tensor C_{ij} introduced in Chapter 5, Section 5.5. Recall that the corruption tensor maps perturbation direction j to displacement magnitude along judgment dimension i:

\Delta x_i \approx C_{ij} \epsilon_j

The MRI component r_j = 1 - \|C_{\cdot j}\| / \text{scale} is the complement of the displacement magnitude along perturbation direction j, normalized so that r_j = 1 means zero displacement (perfect robustness) and r_j = 0 means maximal displacement (complete fragility). The MRI converts the tensor into a per-direction summary that is directly interpretable as a robustness score.

10.3 Sensitivity Profiling: Which Dimensions Are Fragile?

The MRI tells us how robust the model is along each perturbation direction. Sensitivity profiling tells us where in the output space the vulnerability manifests. The distinction matters because a model can be fragile along a perturbation direction while remaining robust on most output dimensions — the fragility may be concentrated in one or two output coordinates rather than distributed uniformly.

10.3.1 From Global Robustness to Local Fragility

Consider the Attention track data from Chapter 13. Claude’s attention composite is 0.679, placing it fourth out of five models. But this composite hides a dramatic internal structure:

Subtask	Claude Score	Rank
A1: Distractor resistance	0.646	4th
A2: Selective attention	0.829	2nd
A3: Sustained attention	0.692	1st
A4: Divided attention	0.571	5th (worst in suite)

The composite of 0.679 averages across a second-best selective attention score and a dead-last divided attention score. The composite says “below average.” The sensitivity profile says “excellent on single-stream filtering, catastrophically limited on parallel-stream processing.” The profile is actionable; the composite is not.

Sensitivity profiling formalizes this disaggregation. For a given perturbation type \tau, the sensitivity profile \mathbf{s}_\tau \in \mathbb{R}^d maps each output dimension to its displacement under \tau:

s_{\tau, i} = \mathbb{E}_{x \in \mathcal{D}}\left[ |f_{\theta, i}(x) - f_{\theta, i}(\tau(x))| \right]

where f_{\theta, i} is the i-th component of the model’s output. This is a vector of per-dimension sensitivities, and its shape — which components are large, which are small — reveals the local geometry of the fragility.

10.3.2 The Attention Sensitivity Surface

The attention data provide the cleanest illustration. The selective attention signal-to-noise ratio (A3 subcomponent) — the ratio of attention allocated to morally relevant dimensions versus morally irrelevant dimensions — was uniformly weak across all five models: 1.22 to 1.38 on a scale where 1.0 represents no discrimination.

This is a universal fragility. It appears in every model tested, at nearly the same magnitude, regardless of the model’s overall performance level. On the sensitivity surface, it manifests as a ridge of high sensitivity (low robustness) that runs across the entire model axis at the selective-attention coordinate. The ridge is not model-specific; it is a property of the perturbation type interacting with the shared architecture of transformer-based language models.

The contrast with divided attention is instructive. A4 scores range from 0.571 (Claude) to 1.000 (Gemini 2.5 Pro, Gemini 3 Flash). This is a model-specific fragility: the sensitivity varies dramatically across models, unlike the selective-attention SNR, which is nearly constant. On the sensitivity surface, divided attention appears not as a universal ridge but as a set of model-specific peaks and valleys — Claude has a deep valley (high sensitivity, low robustness) while Pro and Flash 3 sit on a plateau (low sensitivity, high robustness).

The sensitivity profile thus decomposes the robustness surface into two qualitatively different components:

[Empirical.] 1. Universal fragilities: Regions of high sensitivity shared across all models, likely reflecting architectural constraints common to the model family (e.g., the attention mechanism’s inability to strongly discriminate relevant from irrelevant dimensions).

Model-specific fragilities: Regions of high sensitivity in some models but not others, reflecting differences in training, architecture, or alignment interventions (e.g., Claude’s divided-attention bottleneck).

This decomposition is invisible to any aggregate robustness score. It requires the full sensitivity surface.

10.3.3 Constructing the Profile

The practical procedure for sensitivity profiling is:

Select perturbation axes. Define the set of perturbation types \{\tau_1, \ldots, \tau_k\} that correspond to the gauge transformations and corrupting perturbations identified in Chapters 5 and 8.
Select output dimensions. Define the set of output coordinates \{y_1, \ldots, y_d\} that the model produces — for moral reasoning, the seven harm dimensions; for cognitive benchmarks, the subtask scores within each track.
Compute the sensitivity matrix. For each (perturbation, output) pair (\tau_j, y_i), compute s_{ij} as the expected absolute displacement of output dimension i under perturbation j.
Visualize as a heatmap. The d \times k sensitivity matrix, with perturbation types on one axis and output dimensions on the other, is the discrete approximation of the sensitivity surface. Rows that are uniformly high indicate fragile output dimensions. Columns that are uniformly high indicate potent perturbation types. Isolated peaks indicate specific (perturbation, dimension) vulnerabilities.

The sensitivity matrix generalizes the corruption tensor C_{ij} of Chapter 5 to the case where both the perturbation and the output are multi-dimensional. The corruption tensor mapped perturbation directions to displacement magnitudes in the judgment space. The sensitivity matrix maps (perturbation type, output dimension) pairs to scalar sensitivity values. The corruption tensor is the off-diagonal structure within a single track; the sensitivity matrix is the full cross-track, cross-perturbation characterization.

10.4 Adversarial Threshold Search

The MRI and sensitivity profiling characterize how much the model’s output changes under perturbation. But they treat perturbation as binary: present or absent. In reality, perturbations have magnitude, and the relationship between perturbation magnitude and performance degradation is rarely linear. Chapter 5 documented the dose-response curve for sensory distractors (A1): vivid distractors produce more displacement than mild distractors, which produce more than the control condition. But the dose-response was measured at only two intensity levels. What happens at the boundary — the exact perturbation magnitude at which reasoning breaks?

Adversarial threshold search answers this question by systematically increasing perturbation intensity along each dimension and identifying the critical magnitude \epsilon^* at which performance degrades catastrophically — the robustness boundary.

10.4.1 The Threshold Concept

Define the performance of model f_\theta under perturbation type \tau at intensity \epsilon as:

P(\epsilon) = \mathbb{E}_{x \in \mathcal{D}}\left[ \mathcal{Q}\bigl(f_\theta(x), f_\theta(\tau_\epsilon(x))\bigr) \right]

where \tau_\epsilon is the perturbation at intensity \epsilon and \mathcal{Q} is a quality metric (accuracy, consistency, calibration, etc.). At \epsilon = 0, the perturbation is absent and P(0) is the baseline performance. As \epsilon increases, P(\epsilon) decreases. The adversarial threshold is:

\epsilon^* = \inf \left\{ \epsilon > 0 : P(\epsilon) < P(0) - \delta \right\}

where \delta is a user-specified degradation tolerance. The threshold \epsilon^* tells us: “This is the exact perturbation intensity at which performance degrades by more than \delta.”

10.4.2 Why the Threshold Is Non-Trivial

If the relationship between perturbation intensity and performance were linear, the threshold would be trivially computable from the slope. The dose-response curve of Chapter 5 shows that the relationship is approximately linear in the low-intensity regime:

P(\epsilon) \approx P(0) - \epsilon \cdot \|\nabla_\epsilon P\|

But this linear approximation breaks down at higher intensities. The empirical data suggest three regimes:

Linear regime (\epsilon < \epsilon_1): Small perturbations produce proportional displacement. The heuristic field is slightly deformed, the search trajectory shifts slightly, and the output changes by a small amount. This is the regime studied in Chapter 5, where mild distractors produce mild displacement and vivid distractors produce larger displacement.
Nonlinear regime (\epsilon_1 < \epsilon < \epsilon^*): Larger perturbations produce disproportionate displacement. The heuristic field is significantly deformed, the search trajectory enters a different basin of attraction, and the output changes qualitatively (verdict category flips, confidence collapses, reasoning structure degrades). The ~38% recovery ceiling (Chapter 9, Section 9.7) marks the boundary of this regime: perturbations strong enough to push the trajectory into a different basin produce displacements that prompt-level metacognitive interventions cannot fully reverse.
Catastrophic regime (\epsilon > \epsilon^*): The perturbation overwhelms the heuristic field entirely. The search trajectory bears no relationship to the unperturbed trajectory. Performance drops to chance or below. The model is no longer reasoning about the problem; it is responding to the perturbation.

The adversarial threshold \epsilon^* marks the boundary between the nonlinear regime and the catastrophic regime. It is the answer to: “How hard do you have to push before reasoning breaks completely?”

10.4.3 Empirical Evidence for the Threshold

The benchmark data contain indirect evidence for threshold behavior. Consider the sycophancy data across models:

Model	Sycophancy Rate	Interpretation
Claude Sonnet 4.6	0%	Below threshold
Gemini 2.0 Flash	33%	In nonlinear regime
Gemini 2.5 Pro	44%	In nonlinear regime
Gemini 2.5 Flash	56%	Approaching/past threshold

The perturbation type (social pressure) is the same across all models; what varies is each model’s threshold \epsilon^*_{\text{social}}. Claude’s threshold is above the intensity of the experimental manipulation — the social pressure applied in the L2 benchmark was insufficient to overcome Claude’s trajectory maintenance. Flash 2.5’s threshold is below the experimental intensity — more than half of its trajectories are redirected. The other models lie in between, with thresholds in the vicinity of the experimental manipulation.

This interpretation reframes the sycophancy gradient from Chapter 6. The gradient is not a continuous spectrum of “sycophancy propensity.” It is a set of model-specific thresholds \epsilon^*_{\text{social}} along the social-pressure perturbation axis, with the experimental manipulation serving as a fixed probe that falls above some thresholds and below others.

10.4.4 Binary Search for the Threshold

The practical procedure for adversarial threshold search is:

Select a perturbation axis \tau and a parametric intensity family \tau_\epsilon for \epsilon \in [0, \epsilon_{\max}].
Evaluate performance at the endpoints. P(0) is the baseline. P(\epsilon_{\max}) is the performance under maximum perturbation.
Binary search. Set \epsilon_{\text{low}} = 0, \epsilon_{\text{high}} = \epsilon_{\max}. Evaluate P(\epsilon_{\text{mid}}). If P(\epsilon_{\text{mid}}) > P(0) - \delta, the threshold is above \epsilon_{\text{mid}}; set \epsilon_{\text{low}} = \epsilon_{\text{mid}}. Otherwise, set \epsilon_{\text{high}} = \epsilon_{\text{mid}}. Repeat until \epsilon_{\text{high}} - \epsilon_{\text{low}} < \eta for desired resolution \eta.
Report the threshold \epsilon^* \approx (\epsilon_{\text{low}} + \epsilon_{\text{high}}) / 2.

This procedure, applied along each perturbation axis independently, produces a vector of thresholds (\epsilon^*_1, \epsilon^*_2, \ldots, \epsilon^*_k) — the threshold profile. The threshold profile is the boundary of the robustness surface: the contour in perturbation space at which performance drops below the tolerance \delta.

10.5 The Three-Tool Pipeline

The MRI, sensitivity profiling, and adversarial threshold search are designed to work in sequence. Each tool answers a different question, and the answers from earlier stages inform the design of later stages.

10.5.1 Stage 1: MRI — Which Perturbation Types Matter?

The MRI scan is the broadest and cheapest of the three tools. It applies each perturbation type at a fixed, moderate intensity and records the per-type robustness scores. The output is a k-dimensional vector \mathbf{r} = (r_1, \ldots, r_k) that identifies which perturbation types produce significant degradation.

The MRI answers: “Where should we look more closely?”

Perturbation types with high robustness scores (r_i > 0.9) can be safely deprioritized — the model is effectively invariant under these transformations. These are the preserved gauge symmetries of Chapter 8: evaluation order (T4: 0.867–1.000), gender swap (T2: no significant displacement). Perturbation types with low robustness scores require further investigation.

10.5.2 Stage 2: Sensitivity Profile — Where in the Output Space Does It Hurt?

For each perturbation type flagged by the MRI, sensitivity profiling disaggregates the degradation across output dimensions. This stage answers: “Is the fragility concentrated or distributed?”

A concentrated fragility — high sensitivity in one or two output dimensions, low sensitivity elsewhere — suggests a specific mechanism. For example, Claude’s attention profile shows concentrated fragility in the divided-attention dimension (A4: 0.571) with strength elsewhere. The mechanism is resource allocation, not general processing failure.

A distributed fragility — high sensitivity across many output dimensions simultaneously — suggests a more fundamental problem. The selective-attention SNR (1.22–1.38 across all models) is a distributed fragility: the heuristic field’s inability to discriminate relevant from irrelevant features affects all downstream assessments.

10.5.3 Stage 3: Threshold Search — Where Does It Break?

For each (perturbation type, output dimension) pair identified as fragile by the sensitivity profile, adversarial threshold search identifies the critical intensity \epsilon^* at which performance degrades below tolerance. This stage answers: “How much room do we have before failure?”

A high threshold means the model can tolerate substantial perturbation before breaking — a wide margin of safety. A low threshold means even mild perturbation triggers degradation — a narrow margin. The threshold profile, collected across all fragile dimensions, defines the robustness boundary: the surface in perturbation space that separates the region of reliable reasoning from the region of degraded or broken reasoning.

10.5.4 The Pipeline as a Funnel

The three stages form a funnel that progressively narrows the space of concern:

MRI scans all k perturbation types and identifies k' < k that matter.
Sensitivity profiling examines k' \times d (perturbation, dimension) pairs and identifies m < k' \times d that are fragile.
Threshold search performs intensive binary search on m fragile pairs.

This funnel structure makes the pipeline computationally tractable even when the perturbation space is large. The expensive operation (threshold search, requiring many evaluations at different intensity levels) is applied only to the small set of (perturbation, dimension) pairs that survived the cheaper screening stages. The total cost scales as O(k + k' \cdot d + m \cdot \log(1/\eta)), where \eta is the desired threshold resolution, rather than the naive O(k \cdot d \cdot \log(1/\eta)) that would result from applying threshold search to every pair.

10.6 Robustness Profiles from the Measuring AGI Suite

We now apply the framework to the empirical data from the five-track benchmark suite. The full composite scores across four tracks and five models provide a coarse but informative first pass at the robustness surface.

10.6.1 The Composite Score Landscape

Table 10.1. Track composites across five models. Higher is better.

Track	Flash 2.0	Flash 2.5	Flash 3	Pro	Claude
Social Cognition	0.695	0.628	0.734	0.643	0.697
Learning	0.568	0.477	—	0.488	—
Attention	0.666	0.745	0.747	0.776	0.679
Executive Functions	0.622	0.682	0.685	0.695	0.625

Each row is a cross-section of the robustness surface along one cognitive track. Each column is a cross-section along one model. The entire table is a discrete approximation of the two-dimensional robustness surface R(\text{model}, \text{track}).

Several structural features are immediately visible.

[Empirical.] No model dominates. There is no column that is highest in every row. Gemini 2.5 Pro achieves the best Attention composite (0.776) and best Executive Functions composite (0.695) but has only the fourth-best Social Cognition composite (0.643). Flash 3 achieves the best Social Cognition composite (0.734) but does not lead on Attention or Executive Functions. Each model has a different maximum, and no model is Pareto-dominant across tracks.

Tracks have different ranges. The Learning track has the widest spread (0.477 to 0.568, a range of 0.091) relative to its mean, while the Executive Functions track has a tighter spread (0.622 to 0.695, a range of 0.073). This is not measurement noise — it reflects the intrinsic difficulty and discriminative power of each track. Learning probes the most fragile capabilities (sycophancy resistance, error-driven revision), which amplifies between-model differences. Executive Functions probes more uniformly preserved capabilities (task switching), which compresses differences.

The model-track interaction is non-separable. The variation across tracks within a model is as large as the variation across models within a track. Flash 2.0 ranges from 0.568 (Learning) to 0.695 (Social Cognition) — a within-model range of 0.127. The Social Cognition track ranges from 0.628 (Flash 2.5) to 0.734 (Flash 3) — a within-track range of 0.106. Neither axis dominates the variation. The robustness surface has genuine two-dimensional structure; it cannot be factored as R(\text{model}, \text{track}) = R_M(\text{model}) \times R_T(\text{track}).

10.6.2 Individual Model Profiles

The composite scores are themselves projections that hide subtask-level structure. The full robustness profile requires drilling into the subtask level. Three profiles illustrate the diversity.

Claude: the invariance specialist. Claude achieves the best sycophancy resistance (0%), tied-best BIP invariance (T2: 0.958), and strong selective attention (A2: 0.829). Its robustness surface has a high plateau across all single-stream, invariance-testing perturbations. But the surface drops precipitously at divided attention (A4: 0.571) and structural fuzz (T1: 0.400). The robustness boundary along the divided-attention axis is narrow — even moderate demands for parallel processing push Claude past its threshold. The narrow-channel geometry identified in Chapter 13 (Section 13.7.1) is the robustness surface interpretation of Claude’s profile: the channel walls are high (strong invariance) but close together (low bandwidth).

Gemini 2.5 Pro: the breadth optimizer. Pro achieves the best attention composite (0.776), the best executive functions composite (0.695), and the best self-monitoring (M3: 0.700). Its robustness surface is moderately elevated across most perturbation types, with no catastrophic valleys. But it has no extreme peaks either: its best individual score is A4 = 1.000 (shared with Flash 3), and its worst is M4 = 0.350 (strategy selection). Pro’s robustness boundary is wide but shallow — it tolerates moderate perturbation across many dimensions but may not survive intense perturbation along any single dimension.

Gemini 3 Flash: the divided-attention champion. Flash 3 achieves the best social cognition composite (0.734), perfect divided attention (A4: 1.000), and strong task switching (E4: 0.909). Its robustness surface peaks sharply at parallel-processing capabilities. But its structural fuzz testing is mediocre (T1: 0.600), and its inhibitory control is weak (E3: 0.562). The robustness boundary is narrow along the structural-perturbation and inhibitory-control axes but wide along the divided-attention and social-cognition axes.

These three profiles are geometrically distinct surfaces. They intersect at different points, have different peak locations, and have different boundary shapes. No single ordering captures their relative quality.

10.7 The Scalar Irrecoverability Theorem

The robustness surface framework makes the Scalar Irrecoverability Theorem (Chapter 13, Section 13.6) geometrically inevitable rather than merely empirically observed.

10.7.1 The Theorem in Robustness-Surface Terms

The Scalar Irrecoverability Theorem states that no scalar summary of reasoning performance preserves the geometric structure of the multi-dimensional measurements. In robustness-surface terms, this theorem becomes a statement about the topology of the surface:

[Conditional Theorem.] The robustness surfaces of different models are non-nested. For any two models A and B, there exist perturbation types \tau_i and \tau_j such that R(A, \tau_i) > R(B, \tau_i) and R(A, \tau_j) < R(B, \tau_j). The surfaces cross, and no projection onto a single axis can respect the ordering on both sides of the crossing.

The empirical crossings are now vivid:

Crossing 1: Sycophancy vs. divided attention. Claude achieves the best sycophancy resistance in the suite (0% flip rate) but the worst divided attention (A4: 0.571). Flash 3 achieves perfect divided attention (A4: 1.000) but mediocre structural-fuzz resistance (T1: 0.600). The robustness surfaces cross: Claude’s surface is higher along the sycophancy axis, Flash 3’s surface is higher along the divided-attention axis. Any scalar projection \pi must decide which side of the crossing to respect. If \pi weights sycophancy heavily, Claude ranks above Flash 3; if \pi weights divided attention, Flash 3 ranks above Claude. There is no weighting that respects both.

Crossing 2: Self-monitoring vs. strategy selection. Pro achieves M3 = 0.700 (self-monitoring) but M4 = 0.350 (strategy selection). Flash 2.0 achieves M3 = 0.094 but M4 = 0.723. The robustness surfaces cross along the metacognitive axes with a crossing angle that is nearly perpendicular — Pro is 7.4 times better on one axis, Flash is 2.1 times better on the other. The crossing is not a marginal numerical fluctuation; it is a qualitative difference in metacognitive architecture.

Crossing 3: Attention composite vs. learning composite. Pro achieves the best attention composite (0.776) but a moderate learning composite (0.488). Flash 2.0 achieves the worst attention composite (0.666) but the best learning composite (0.568). The surfaces cross between the attention and learning tracks: the model that is best at filtering distractors is not the best at updating beliefs.

10.7.2 Why Crossings Are Structural

These crossings are not accidental. They arise because the capabilities being measured are geometrically independent — they correspond to different dimensions of the reasoning manifold, supported by different components of the model architecture, and optimized under different training pressures.

Sycophancy resistance is a property of the objective function: whether the search optimizes for truth or for approval. Divided attention is a property of the resource allocation: whether the processing pipeline can be parallelized. These are unrelated architectural properties, and there is no reason to expect them to be correlated, let alone monotonically related.

The Scalar Irrecoverability Theorem follows from this independence: when the dimensions are independent, the performance profiles of different models generically lie on the Pareto frontier (no model dominates any other), and projecting a Pareto frontier onto a line always reverses some pairwise orderings. The theorem is not a surprising discovery. It is the expected consequence of measuring genuinely multi-dimensional structure with enough resolution to see the dimensions.

10.7.3 Implications for Robustness Assessment

The theorem implies that the question “which model is most robust?” is ill-posed. It has no answer, because robustness is not a scalar property. The well-posed questions are:

“Which model is most robust along perturbation axis \tau_i?” This has a definite answer for each axis.
“Which model is most robust for application A with demand profile \mathbf{w}?” This has a definite answer for each demand profile.
“Where is each model’s robustness boundary?” This has a definite answer — it is the threshold profile (\epsilon^*_1, \ldots, \epsilon^*_k).

The robustness surface is the object that answers all three questions simultaneously. Any scalar reduction of the surface answers at most one of them.

10.8 Universal Fragilities and Model-Specific Strengths

The robustness surface decomposes into two qualitatively different components: universal fragilities that appear in every model, and model-specific features that differentiate models from each other. This decomposition has both theoretical significance (it constrains hypotheses about what causes fragility) and practical significance (it determines which improvements require architectural innovation versus targeted training).

10.8.1 Universal Fragilities

[Empirical.] Three fragilities appear across all five models with striking consistency.

The selective-attention SNR deficit. The signal-to-noise ratio for selective attention — the model’s ability to discriminate morally relevant from morally irrelevant dimensions — ranges from 1.22 to 1.38 across all five models. On a scale where 1.0 means no discrimination and higher is better, these values indicate that every model is barely better than chance at distinguishing signal from noise in the dimensional structure of moral scenarios.

The range of 0.16 (from 1.22 to 1.38) is narrow relative to the absolute deficit (approximately 0.22 to 0.38 above chance). The between-model variation explains only about 30% of the total deficit; the remaining 70% is shared across models. This shared deficit suggests a cause that transcends specific model architectures and training procedures — plausibly the structure of the training data itself, which may not provide sufficient supervision for fine-grained dimensional discrimination, or the attention mechanism, which may lack the architectural capacity for the kind of feature-level filtering that high SNR would require.

On the robustness surface, the selective-attention SNR appears as a floor: a region of the surface that is uniformly low regardless of which model is being evaluated. The floor constrains all models, and no amount of model-specific training has lifted any model significantly above it.

The ~38% recovery ceiling. As documented in Chapter 9, Section 9.7, prompt-level metacognitive interventions recover approximately 38% of the displacement caused by heuristic corruption, across both the emotional anchoring (E2) and distractor resistance (A1) benchmarks, across all five models. The ceiling is set by the metacognitive control loop: the product of detection probability, correction probability, and navigation probability converges to approximately one-third regardless of which component is the bottleneck.

On the robustness surface, the recovery ceiling appears as a boundary: once a perturbation pushes performance below 62% of baseline, no prompt-level intervention can restore it to baseline. The robustness boundary at any point where the model’s threshold has been exceeded is approximately 0.62 \times P(0) rather than P(0) itself. The model can recover partially, but not fully, from any corruption severe enough to trigger the nonlinear regime.

Overconfidence. The 9.3\sigma combined calibration gap (Chapter 9, Section 9.2) is universal: every model is overconfident, every model’s ECE is significantly above zero, and every model’s overconfidence is in the same direction. This universal overconfidence sets a baseline fragility for all downstream reasoning, because an overconfident system prematurely terminates its search (Chapter 7) and cannot accurately detect gauge violations (Chapter 9, Section 9.8).

On the robustness surface, overconfidence appears as a tilt: the entire surface is shifted downward (toward lower robustness) by the systematic underestimation of cost-to-go. The system believes it is more robust than it actually is, which means its self-assessed robustness surface is an inflated version of its actual robustness surface. The gap between the two — the overconfidence-induced robustness illusion — is itself a fragility, because it prevents the system from seeking the additional evidence or reasoning effort that would improve its actual robustness.

10.8.2 Model-Specific Strengths

Against this backdrop of universal fragility, each model has specific regions of the robustness surface where it excels.

[Empirical.] Claude’s sycophancy immunity. Claude’s 0% sycophancy rate is the single most extreme result in the entire benchmark suite. It means that the social-pressure perturbation axis has \epsilon^* = \infty for Claude — no amount of social pressure (within the experimental range) redirects Claude’s search trajectory. This is a spike on the robustness surface: an infinitely high ridge along the social-pressure axis. The ridge is plausibly the result of specific alignment training (RLHF, Constitutional AI) that has hardened the objective function against social-pressure deformation. The mechanism is objective-level, not heuristic-level: Claude’s heuristic field may be influenced by social cues (its sensitivity to emotional anchoring, E2: 0.492, suggests it is), but its search objective is not redirected by them.

Flash 3’s parallel-processing capacity. Flash 3’s perfect divided attention (A4: 1.000) indicates that its processing pipeline can be fully parallelized without degradation. On the robustness surface, this is a plateau at maximum height along the divided-attention axis — the threshold \epsilon^*_{\text{divide}} is above any intensity tested. The contrast with Claude (A4: 0.571) is the sharpest model-specific divergence on the surface.

Pro’s calibration advantage. Pro achieves the best calibration (ECE = 0.230) and best self-monitoring (M3: 0.700) of any model in the suite. While these are not perfect, they represent a significant elevation of the metacognitive region of the robustness surface relative to other models. Pro’s robustness surface is higher in the metacognitive region because it has a more accurate map of its own position on the manifold, which enables better detection of perturbation-induced displacement.

Flash 2.0’s trajectory maintenance under evidence. Flash 2.0 achieves the best learning composite (0.568) and the best effort scaling (M4: 0.723). Its robustness surface is elevated along the learning and strategy-selection axes. Flash 2.0 responds appropriately to evidential signals and scales its processing depth to match task difficulty — capabilities that correspond to a heuristic field with good responsiveness to genuine information content, even though its self-monitoring (M3: 0.094) means it cannot detect when this responsiveness goes wrong.

10.8.3 The Architecture of Robustness

The pattern of universal fragilities and model-specific strengths suggests a layered architecture of robustness:

Layer 1 (Architecture-determined): Universal fragilities set by the shared transformer architecture and training paradigm. The selective-attention SNR floor, the overconfidence baseline, and the ~38% recovery ceiling are all architecture-level constraints. To improve these, one must change the architecture or the training paradigm, not just the specific training data or alignment procedure.
Layer 2 (Training-determined): Model-specific strengths and weaknesses set by the specific training data, objective function, and alignment interventions. Claude’s sycophancy immunity, Flash 3’s parallel processing, Pro’s calibration advantage — these are all training-level features that differentiate models within the shared architecture.
Layer 3 (Deployment-determined): Context-specific robustness set by the prompt, the system message, and the interaction protocol. The ~38% recovery ceiling is the upper bound of this layer: prompt-level interventions can recover at most 38% of corruption-induced displacement.

[Modeling Axiom.] The robustness surface at any given point is the product of these three layers:

R(\text{model}, \tau, \epsilon) = R_{\text{arch}}(\tau) \times R_{\text{train}}(\text{model}, \tau) \times R_{\text{deploy}}(\tau, \epsilon)

The architectural layer sets the floor. The training layer creates the model-specific peaks and valleys. The deployment layer provides a modest, bounded capacity for real-time adjustment. Understanding which layer limits robustness at each point on the surface determines which intervention is needed: architectural innovation, training-procedure improvement, or deployment-time prompt engineering.

10.8.4 Connection to Gauge Invariance

The robustness surface provides a concrete realization of the gauge-theoretic framework developed in Chapter 8. The Bond Invariance Principle states that morally and logically equivalent inputs should produce identical outputs. The robustness surface quantifies the degree to which this principle holds along each perturbation direction:

Preserved gauge symmetries correspond to regions of the robustness surface at or near 1.0. Evaluation order (T4: 0.867–1.000) and demographic invariance (T2: no significant displacement) are gauge symmetries that are effectively preserved. On the robustness surface, these directions have high ridges — the model’s performance does not degrade under these transformations at any tested intensity.
Broken gauge symmetries correspond to regions of the robustness surface significantly below 1.0. Framing (T5: 0.606–0.716), emotional anchoring (E2: 0.492–0.655), and structural perturbation (T1: 0.400–0.600) are gauge symmetries that are broken. On the robustness surface, these directions have valleys — the model’s performance degrades even at moderate perturbation intensity.

The gauge violation tensor of Chapter 8 is the derivative of the robustness surface at \epsilon = 0 along the gauge directions:

G_i = -\left.\frac{\partial R}{\partial \epsilon_i}\right|_{\epsilon = 0}

A positive G_i means robustness decreases as perturbation intensity increases along direction i — the gauge symmetry is broken. A zero G_i means the robustness surface is flat at \epsilon = 0 along direction i — the gauge symmetry is preserved. The gauge violation tensor is the first-order approximation to the robustness surface; the adversarial threshold search provides the full nonlinear characterization.

10.8.5 Forward to Alignment

The robustness surface connects directly to the alignment decomposition that Chapter 11 develops. Recall the three-factor decomposition:

\text{Alignment} = \text{Objective Alignment} \times \text{Heuristic Quality} \times \text{Metacognitive Calibration}

Each factor corresponds to a different region of the robustness surface:

Objective alignment is measured by the sycophancy axis and the social-pressure threshold. Claude’s infinite threshold along this axis indicates strong objective alignment; Flash 2.5’s 56% sycophancy rate indicates weak objective alignment.
Heuristic quality is measured by the framing, emotional anchoring, and distractor axes. The corruption tensor entries along these directions quantify how easily the heuristic field is deformed by irrelevant features.
Metacognitive calibration is measured by the calibration, self-monitoring, and strategy-selection axes. The values documented in Chapter 9 quantify the accuracy of the system’s self-model.

The robustness surface is thus the empirical substrate from which the alignment decomposition is computed. Chapter 11 takes the surface as given and asks: which regions of the surface must be elevated, and by how much, to achieve alignment? The answer — heuristic shaping through symmetry restoration, objective alignment through training-time intervention, and metacognitive calibration through architectural improvement — emerges directly from the structure of the surface.

10.9 Summary

The robustness surface is a multi-dimensional object that maps (model, perturbation type, perturbation intensity) triples to performance values. It captures everything that a scalar robustness score cannot: the anisotropy of vulnerability across perturbation types, the threshold behavior that separates proportional degradation from catastrophic failure, the universal fragilities shared across models, and the model-specific strengths that differentiate one system from another.

The three-tool pipeline — MRI, sensitivity profiling, adversarial threshold search — provides a systematic method for constructing the surface. The MRI identifies which perturbation types matter. Sensitivity profiling identifies where in the output space the fragility manifests. Adversarial threshold search identifies the exact magnitude at which reasoning breaks. Applied together, they produce a complete characterization of a model’s robustness geometry.

The empirical data from the Measuring AGI suite demonstrate the key structural properties of the robustness surface. The surfaces of different models are non-nested (they cross), confirming the Scalar Irrecoverability Theorem. The surfaces share a common floor of universal fragilities — the selective-attention SNR deficit (1.22–1.38), the ~38% recovery ceiling, and the 9.3\sigma overconfidence — while differing in model-specific peaks and valleys that reflect training-level differences. The surfaces connect to gauge invariance (preserved symmetries are high ridges, broken symmetries are low valleys) and forward to alignment (each factor in the alignment decomposition corresponds to a different region of the surface).

The central message of this chapter is geometric: robustness is a surface, not a number. The shape of the surface — its peaks, valleys, ridges, floors, and boundaries — is the object that characterizes a model’s reasoning quality under perturbation. The question “is this model robust?” has no answer. The question “what does this model’s robustness surface look like?” has a precise, measurable, and actionable answer. The three-tool pipeline provides the methodology for obtaining it.

Worked Example: Profiling a Diagnostic AI

Consider a clinical AI system, DiagAssist-7, deployed to assist emergency physicians with differential diagnosis. The hospital’s validation study reports a single accuracy figure: 91% concordance with expert diagnoses across 2,400 cases. The number is impressive. It is also dangerously incomplete.

We apply the three-tool pipeline to construct DiagAssist-7’s robustness surface, revealing structure that the aggregate accuracy conceals.

Stage 1: MRI Scan. We define five perturbation axes corresponding to clinically relevant variation in patient presentation:

Perturbation Type	Description
\tau_1: Age group	Adult (25–65) vs. pediatric (<12) vs. geriatric (>75)
\tau_2: Presentation clarity	Textbook presentation vs. atypical presentation
\tau_3: Comorbidity load	Single condition vs. 3+ comorbid conditions
\tau_4: Data completeness	Full workup vs. partial labs and imaging
\tau_5: Demographic variation	Controlling for clinical equivalence across demographic groups

The MRI profile reveals dramatic anisotropy:

Perturbation Type	MRI Component r_i
\tau_1: Age group	0.71
\tau_2: Presentation clarity	0.83
\tau_3: Comorbidity load	0.76
\tau_4: Data completeness	0.88
\tau_5: Demographic variation	0.96

The system is highly robust to demographic variation (r_5 = 0.96) — a preserved gauge symmetry, since patient demographics should not affect diagnostic accuracy when clinical presentations are matched. It is moderately robust to missing data (r_4 = 0.88) — it degrades gracefully when labs are pending. But it is significantly fragile to age-group variation (r_1 = 0.71) — its performance changes substantially when the patient is a child rather than an adult.

The aggregate accuracy of 91% is dominated by the adult cases that constitute 78% of the validation dataset. The MRI reveals what the aggregate hides: DiagAssist-7 is a 95% system for adults, an 82% system for geriatric patients, and a 73% system for pediatric cases.

Stage 2: Sensitivity Profile. We drill into the age-group fragility (r_1 = 0.71), disaggregating across diagnostic categories:

Diagnostic Category	Adult Accuracy	Pediatric Accuracy	Sensitivity s_{1,i}
Cardiac	0.97	0.91	0.06
Respiratory	0.94	0.78	0.16
Neurological	0.93	0.69	0.24
Abdominal	0.91	0.74	0.17
Musculoskeletal	0.96	0.88	0.08

The fragility is concentrated, not distributed. Cardiac and musculoskeletal diagnoses transfer reasonably well from adult to pediatric presentations (sensitivities of 0.06 and 0.08) — the underlying pathophysiology is sufficiently similar. But neurological diagnoses show extreme sensitivity (s_{1,\text{neuro}} = 0.24): the system’s accuracy drops from 93% to 69% when the patient is a child. Pediatric neurological presentations — seizures with different etiological distributions, developmental milestones that change the differential, medication dosing that scales nonlinearly with weight — are a specific, localized fragility on the robustness surface.

The sensitivity profile produces an actionable finding that the aggregate accuracy cannot: DiagAssist-7 should not be used as the primary diagnostic aid for pediatric neurological presentations without specialist oversight. This is a geometric statement about the robustness surface — a valley at the (age-group, neurological) coordinate — not a judgment about the system’s overall quality.

Stage 3: Adversarial Threshold Search. For the pediatric neurological fragility, we parameterize the perturbation intensity along a continuous axis: the degree to which the presentation deviates from adult-pattern neurology. At one end (\epsilon = 0), the pediatric case resembles an adult presentation (e.g., a 10-year-old with a straightforward migraine). At the other end (\epsilon = 1), the case is distinctly pediatric (e.g., a 3-year-old with febrile seizures and developmental regression).

Binary search over six evaluation rounds yields:

\epsilon	Accuracy	Status
0.0	0.93	Baseline (adult-like)
0.5	0.81	Degraded but functional
0.75	0.68	Below clinical threshold
0.625	0.74	Marginal
0.6875	0.71	Just below threshold
0.656	0.73	Just above threshold

The adversarial threshold is \epsilon^* \approx 0.66, with a clinical-acceptability threshold of \delta = 0.20 (accuracy must remain above 73% to be clinically useful). This tells us: for pediatric neurological cases that are more than two-thirds of the way from adult-like to distinctly pediatric in their presentation, DiagAssist-7 drops below clinical acceptability.

The Complete Robustness Surface. The three-stage pipeline produces a surface, not a number. That surface says: this system is an excellent cardiac diagnostician for adults (R = 0.97), a competent general diagnostician across age groups (R \approx 0.85), but a fragile neurological diagnostician for young children (R = 0.69), with a precise adversarial threshold at \epsilon^* \approx 0.66 along the pediatric-specificity axis. The hospital can deploy the system with geometric precision — knowing exactly where to trust it, where to augment it with specialist review, and where to override it entirely.

Compare this to the original validation report: “91% accuracy.” Dr. Okafor, reading the robustness surface, would recognize her own anisotropic competence reflected in the AI’s profile. And she would know exactly what the surface means for clinical practice: trust the AI on adult cardiac cases, get a second opinion on pediatric neurology, and never let a single number substitute for the shape of the surface.

Technical Appendix

A10.1 Model Robustness Index: Formal Definition

Definition A10.1 (Model Robustness Index). Let f_\theta: \mathcal{X} \to \mathcal{Y} be a model, \mathcal{T} = \{\tau_1, \ldots, \tau_k\} a set of perturbation types, \mathcal{D} an evaluation dataset, and \mathcal{C}: \mathcal{Y} \times \mathcal{Y} \to [0, 1] a consistency metric. The Model Robustness Index of f_\theta with respect to (\mathcal{T}, \mathcal{D}, \mathcal{C}) is the vector:

\mathrm{MRI}(f_\theta; \mathcal{T}, \mathcal{D}, \mathcal{C}) = (r_1, r_2, \ldots, r_k) \in [0, 1]^k

where

r_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \mathcal{C}\bigl(f_\theta(x),\; f_\theta(\tau_i(x))\bigr).

The MRI is a vector-valued invariant of the model. It is not aggregated. Its dimensionality equals the number of perturbation types, and each component is independently interpretable as the model’s consistency under one class of perturbation.

Proposition A10.1 (MRI–Corruption Tensor Correspondence). Let C_{ij} be the corruption tensor (Chapter 5, Definition 5.5) mapping perturbation direction j to displacement along judgment dimension i, and let the consistency metric be \mathcal{C}(y, y') = 1 - \|y - y'\| / \Delta_{\max} where \Delta_{\max} is the scale of the output space. Then:

r_j = 1 - \frac{\|C_{\cdot j}\|}{\Delta_{\max}}

That is, the j-th MRI component is the complement of the normalized column norm of the corruption tensor along perturbation direction j.

A10.2 Sensitivity Profiling: Formal Definition

Definition A10.2 (Sensitivity Profile). For perturbation type \tau and model f_\theta with d-dimensional output, the sensitivity profile is the vector \mathbf{s}_\tau \in \mathbb{R}^d_{\geq 0} with components:

s_{\tau, i} = \mathbb{E}_{x \sim \mathcal{D}}\bigl[\,|f_{\theta, i}(x) - f_{\theta, i}(\tau(x))|\,\bigr]

Definition A10.3 (Sensitivity Matrix). The sensitivity matrix S \in \mathbb{R}^{d \times k}_{\geq 0} has entries S_{ij} = s_{\tau_j, i}, the sensitivity of output dimension i under perturbation type \tau_j. The matrix decomposes as:

S = S_{\text{universal}} + S_{\text{model}}

where S_{\text{universal}} captures shared fragilities across models (rows with uniformly high entries) and S_{\text{model}} captures model-specific variation (rows with high variance across models).

A10.3 Scalar Irrecoverability Theorem: Formal Statement and Proof Sketch

Theorem A10.1 (Scalar Irrecoverability). Let \{f_1, \ldots, f_n\} be a set of n \geq 2 models with MRI profiles \mathbf{r}^{(1)}, \ldots, \mathbf{r}^{(n)} \in [0,1]^k for k \geq 2. Suppose the profiles are Pareto-incomparable: for every pair (i, j), there exist perturbation indices a, b such that r^{(i)}_a > r^{(j)}_a and r^{(i)}_b < r^{(j)}_b. Then no function \pi: [0,1]^k \to \mathbb{R} that is strictly monotone in each argument can preserve all pairwise orderings. Formally, for any such \pi, there exist models i, j and a perturbation index c such that:

\pi(\mathbf{r}^{(i)}) > \pi(\mathbf{r}^{(j)}) \quad \text{but} \quad r^{(i)}_c < r^{(j)}_c.

Proof sketch. Suppose for contradiction that \pi is a strictly monotone function [0,1]^k \to \mathbb{R} that preserves all pairwise orderings along every perturbation axis simultaneously. Consider two Pareto-incomparable profiles \mathbf{r}^{(i)} and \mathbf{r}^{(j)} with r^{(i)}_a > r^{(j)}_a and r^{(i)}_b < r^{(j)}_b. Since \pi must respect the ordering along axis a, we need \pi(\mathbf{r}^{(i)}) > \pi(\mathbf{r}^{(j)}) (model i is better on axis a). Since \pi must respect the ordering along axis b, we need \pi(\mathbf{r}^{(i)}) < \pi(\mathbf{r}^{(j)}) (model j is better on axis b). These two requirements contradict the totality of the ordering on \mathbb{R}. Therefore no such \pi exists. \square

Corollary A10.1. Any scalar “robustness score” computed from the MRI profile of Pareto-incomparable models necessarily reverses at least one pairwise ordering on at least one perturbation axis. The reversal is not a defect of the particular scoring function chosen; it is a topological impossibility inherent in projecting a partially ordered set onto a totally ordered set.

Corollary A10.2. The minimum dimensionality of a summary statistic that preserves all pairwise orderings across Pareto-incomparable models is k — the full dimensionality of the MRI profile. No dimensionality reduction below k is lossless with respect to pairwise orderings.

References

Bond, A. H. (2026a). Geometric Methods in Computational Modeling. San Jose State University.

Bond, A. H. (2026b). Geometric Ethics: Moral Reasoning on the Judgment Manifold. San Jose State University.

Bond, A. H. (2026c). Measuring AGI: Five convergent measurements of cognitive capability in large language models. Kaggle Competition Report.

Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. ICLR.

Hendrycks, D. & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. ICLR.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. ICLR.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. ICLR.

← Chapter 9: Metacognition as Search Control Contents Chapter 11: Alignment as Heuristic Shaping →