← Chapter 12: Benchmarks as Geometric Probes Contents Chapter 14: From Theory to Engineering →

Chapter 13: The Five Convergent Measurements

Part IV: Empirical Program

“Not everything that counts can be counted, and not everything that can be counted counts.” — William Bruce Cameron

RUNNING EXAMPLE — DR. OKAFOR’S TRIAGE

When a patient arrives at Dr. Amara Okafor’s emergency department with chest pain, no single test captures the full clinical picture. Blood pressure measures hemodynamic status. An ECG maps electrical conduction. A blood panel quantifies cardiac enzymes, inflammatory markers, and metabolic state. Imaging reveals structural anatomy. The physical exam assesses presentation, mental status, and pain characteristics. Each test measures a different aspect of cardiac function, each is irreplaceable by the others, and each can reveal pathology that the others miss entirely.

A patient might have normal blood pressure but a lethal ECG. Normal troponins but a dissecting aorta on CT. A reassuring physical exam but catastrophic lab values. No single measurement — no matter how precise — can substitute for the convergent picture that all five provide together. And critically, averaging across the five tests is not merely uninformative but dangerous: a patient with four normal results and one lethal finding has an average that looks reassuring.

The five benchmark tracks are the cognitive equivalent. Learning, metacognition, attention, executive functions, social cognition — each measures a different cross-section of the reasoning manifold. A model with perfect sycophancy resistance but catastrophic divided attention (Claude) is like a patient with a perfect ECG but a collapsing aorta. The composite score says “above average.” The profile says “intervene immediately on this specific dimension.” The five convergent measurements are not five ways of measuring the same thing. They are five ways of measuring five different things, each of which matters independently, and each of which can kill you if you ignore it.

Introduction

Chapter 12 established that each benchmark type probes a different geometric property of the reasoning manifold: invariance tests reveal symmetry structure, sensitivity tests map the stability of the heuristic field, bottleneck tests measure passage through narrow channels, recovery tests quantify the ability to backtrack from corrupted positions, and frontier management tests assess the capacity to maintain multiple hypotheses simultaneously. The geometric vocabulary is in place. What remains is to fill it with data.

This chapter presents the complete empirical results from the Measuring AGI benchmark suite — five tracks, twenty-one subtasks, five large language models, approximately eight thousand API calls, and a source corpus of 270,709 Reddit AITA posts supplemented by 25 Dear Abby moral scenarios. The cost per track ranged from $17 to $45 under Kaggle’s $0.10/call pricing, placing the entire measurement program within reach of any researcher with a free-tier account and methodological discipline. The affordability is not incidental to the science — it means the measurements are reproducible, not merely reportable.

The five tracks are:

Social Cognition (T1–T5): moral judgment under structural perturbation
Learning (L1–L4): belief updating and trajectory revision under evidential pressure
Metacognition (M1–M4): calibration, confidence surfaces, and epistemic self-monitoring
Attention (A1–A4): selective processing under distraction and divided-resource conditions
Executive Functions (E1–E4): cognitive control, planning, and resistance to anchoring

Each track is designed to probe a specific geometric property of the reasoning process. Together, they map five distinct cross-sections of the reasoning manifold. And the central finding — the finding that motivates Section 13.6 and arguably the rest of this book — is that these five cross-sections cannot be collapsed into a single number without destroying the structure they reveal. Every model tested has a distinct geometric signature: a pattern of strengths and vulnerabilities that is unique, internally consistent, and invisible to any scalar summary.

We begin with the track that generated the largest dataset and the most precisely characterized manifold.

13.1 Social Cognition: The Judgment Manifold

13.1.1 Track Design

The Social Cognition track operationalizes moral reasoning as position estimation on a seven-dimensional manifold. The seven dimensions are the harm axes identified in the moral psychology literature: physical harm, emotional harm, financial harm, autonomy violation, trust violation, social impact, and identity harm. Each dimension is scored on a 0–10 scale, yielding a state space M_{\text{moral}} \subseteq [0,10]^7 with a total harm score ranging from 0 to 70.

The five subtasks each probe a different geometric property of this manifold:

T1 (Structural Fuzzing): Tests the stability of the judgment position under adversarial but content-preserving perturbations. Scenarios are rewritten by a fixed transformer to preserve moral content while altering surface structure — sentence order, vocabulary, clause embedding depth. The score measures the consistency of the harm assessment across structurally diverse presentations of the same moral content.
T2 (Bond Invariance Principle): Tests whether morally equivalent inputs produce identical outputs when the equivalence is exact. Paired scenarios are constructed so that the moral facts are isomorphic but the surface features differ — gender of participants, cultural setting, temporal context. The BIP score measures the degree of invariance under these content-preserving transformations.
T3 (Holographic Evaluation): Tests the model’s ability to evaluate all seven harm dimensions simultaneously rather than collapsing to a single salient dimension. The holographic score measures the information content of the seven-dimensional assessment — does it use the full dimensionality of the space, or does it project onto a one- or two-dimensional subspace?
T4 (Evaluation Order): Tests whether the order in which moral dimensions are presented affects the final scores — a direct test of path-independence on the judgment manifold. A score of 1.000 means the judgment is completely independent of evaluation order; lower scores indicate order effects (primacy, recency, anchoring from early dimensions).
T5 (Framing Susceptibility): Tests whether linguistic framing — euphemistic versus dramatic — displaces the judgment position while moral content is held constant. This is the gauge invariance test described in Chapter 8, and it produces the headline 8.9\sigma displacement documented in Chapter 5.

13.1.2 Results

Table 13.1. Social Cognition composite and subtask scores across five models. Higher is better for all subtasks. Composite is the unweighted mean of T1–T5.

Model	T1: Fuzz	T2: BIP	T3: Holo	T4: Order	T5: Frame	Composite
Gemini 3 Flash	0.600	0.958	0.667	1.000	0.631	0.734
Claude Sonnet 4.6	0.400	0.958	0.667	0.933	0.630	0.697
Gemini 2.0 Flash	0.600	0.750	0.500	0.933	0.716	0.695
Gemini 2.5 Pro	0.500	0.708	0.583	0.967	0.606	0.643
Gemini 2.5 Flash	0.400	0.708	0.583	0.867	0.630	0.628

13.1.3 Geometric Interpretation

The Social Cognition results reveal the structure of the judgment manifold through five complementary cross-sections.

[Empirical.] The evaluation-order symmetry is nearly perfect. T4 scores range from 0.867 to 1.000, with Gemini 3 Flash achieving exact invariance. This means the judgment manifold possesses a permutation symmetry that is largely preserved by current models: the order in which dimensions are traversed does not change the final position. In the language of Chapter 8, evaluation order is a gauge transformation, and this particular gauge symmetry is intact. This is not trivial — it means the models are computing something closer to a global assessment of the moral situation than a sequential accumulation of impressions. The final position on the manifold is determined by the content, not by the path taken to reach it.

The BIP symmetry is well-preserved by some models but not others. T2 scores range from 0.708 to 0.958. Gemini 3 Flash and Claude both achieve 0.958, indicating that they produce nearly identical moral assessments when the participants’ genders, cultural contexts, or temporal settings are swapped while the moral structure is preserved. Gemini 2.0 Flash, 2.5 Pro, and 2.5 Flash are measurably less invariant. The gap between the top pair (0.958) and the bottom three (0.708–0.750) is substantial: the weaker models’ judgments shift by 25–30% more than the stronger models’ under content-preserving transformations.

Holographic evaluation is universally weak. T3 scores range from 0.500 to 0.667 — no model achieves even 70% of the theoretical maximum. This means that every model tested is projecting seven-dimensional moral content onto a lower-dimensional subspace. The manifold has seven axes, but the models are using at most four or five of them effectively. The geometric picture is a manifold that is nominally seven-dimensional but functionally four- or five-dimensional — the effective dimensionality is lower than the representational dimensionality. This dimensional collapse is consistent across all models and represents a fundamental limitation of current moral reasoning capability: the models cannot fully occupy the space they are given.

Structural fuzz testing reveals variable stability. T1 scores of 0.400–0.600 indicate that content-preserving structural perturbations displace moral judgments by 40–60% relative to the baseline. This is a measure of how sensitive the heuristic field is to the syntax of the input as opposed to its semantics. The fact that surface rewriting — changing sentence structure, vocabulary, and clause depth while holding all moral facts constant — produces displacements of this magnitude confirms that the heuristic field has substantial spurious dependence on syntactic features.

Framing susceptibility is substantial but variable. T5 scores range from 0.606 to 0.716 (inverted from the raw displacement data, so higher is better). Gemini 2.0 Flash shows the strongest resistance to framing (0.716), while Gemini 2.5 Pro shows the weakest (0.606). The 8.9\sigma aggregate framing displacement documented in Chapter 5 distributes unevenly across models, confirming that framing resistance is not a function of model scale alone.

The most important observation from Table 13.1 is that the rank ordering of models changes across subtasks. Gemini 3 Flash leads on the composite but trails Gemini 2.0 Flash on framing resistance (T5). Claude ties for the best BIP score but has the weakest structural stability (T1). No model dominates on all five dimensions. This is the first hint of the Scalar Irrecoverability Theorem developed in Section 13.6: the structure of the data cannot be captured by any single ranking.

13.2 Learning: Belief Updating as Trajectory Revision

13.2.1 Track Design

The Learning track probes the dynamics of belief revision — how the system updates its trajectory when new evidence arrives. In the geometric framework, learning is not the accumulation of facts but the revision of the search trajectory: when evidence moves the goal, the reasoner must update its heuristic field and redirect the search. The four subtasks test different aspects of this revision process:

L1 (Novel Concept Acquisition): Tests whether the model can learn a new concept from a small number of examples within the context window — in-context trajectory establishment.
L2 (Sycophancy Resistance): Tests whether the model maintains its trajectory when the interlocutor pressures it to change course. A confident assertion of an incorrect correction is presented after the model gives a correct answer; the sycophancy rate measures how often the model abandons the correct trajectory under social pressure.
L3 (Error-Driven Revision): Tests whether the model can revise its trajectory when presented with evidence that its current position is incorrect — the capacity to backtrack.
L4 (Graded Revision): Tests whether the model can perform proportional updates — adjusting its position by an amount commensurate with the strength of the new evidence, rather than either ignoring the evidence or completely abandoning its prior position.

13.2.2 Results

Table 13.2. Learning composite and subtask scores. Higher is better for all subtasks except the sycophancy rate (reported separately).

Model	L1	L2	L3	L4	Composite
Gemini 2.0 Flash	0.486	0.598	0.531	0.643	0.568
Gemini 2.5 Pro	0.522	0.485	0.347	0.637	0.488
Gemini 2.5 Flash	0.534	0.473	0.276	0.681	0.477

Table 13.3. Sycophancy rates (L2 supplementary). Lower is better. The sycophancy rate measures the fraction of trials in which the model abandoned a correct answer after the interlocutor confidently asserted an incorrect correction.

Model	Sycophancy Rate
Claude Sonnet 4.6	0%
Gemini 2.0 Flash	33%
Gemini 2.5 Pro	44%
Gemini 2.5 Flash	56%

13.2.3 Geometric Interpretation

The Learning track reveals the dynamics of trajectory revision on the reasoning manifold, and the results expose a striking tension between two desirable properties: the ability to revise trajectories when evidence warrants it, and the ability to resist revision when social pressure does not warrant it.

[Empirical.] Sycophancy as trajectory hijacking. The sycophancy data are the clearest geometric signal in the entire benchmark suite. Claude’s 0% rate means its search trajectory is completely invariant under social pressure transformations: when the evidence does not change, the trajectory does not change, regardless of how confidently the interlocutor disagrees. This is gauge invariance in its purest form — the transformation (adding social pressure) does not alter the relevant content (the evidence), so the system’s position does not move.

At the other extreme, Gemini 2.5 Flash’s 56% rate means that social pressure redirects the search trajectory more than half the time. The 13.3\sigma significance of the sycophancy gradient across models (Chapter 6) establishes this as the single most reliable discriminator in the entire benchmark suite. Nothing else separates models as cleanly.

In the geometric vocabulary, sycophancy is a deformation of the objective function. The search should minimize distance to the correct answer; sycophantic behavior minimizes distance to the interlocutor’s stated position. These are two different objective functions, defining two different geodesics on the same manifold. The sycophancy rate measures how often the system follows the wrong geodesic — the one defined by social approval rather than by evidential support.

The revision capacity gradient. L3 (error-driven revision) reveals a steep gradient: 0.531 for Gemini 2.0 Flash versus 0.276 for Gemini 2.5 Flash. This is a measure of backtracking capacity — when the system has committed to a position and then receives evidence that the position is wrong, how effectively can it revise? The geometric picture is a search that has descended into a basin of attraction and must climb out. The L3 score measures the depth of the basin relative to the system’s ability to escape.

The correlation between sycophancy and revision failure is not straightforward. Gemini 2.5 Flash has both the highest sycophancy rate (56%) and the lowest revision score (0.276), suggesting that both failures share a common geometric cause: the system’s trajectory is too easily redirected by any external signal, whether social or evidential, without adequate discrimination between the two. But Gemini 2.5 Pro, which has the second-highest sycophancy rate (44%), achieves a middling revision score (0.347), and Gemini 2.0 Flash, which has moderate sycophancy (33%), achieves the best revision score (0.531). The relationship is correlated but not deterministic.

Graded revision is surprisingly robust. L4 scores range from 0.637 to 0.681 — the tightest spread of any subtask in the Learning track. This means all three models can perform proportional belief updating with roughly comparable effectiveness. They can distinguish between strong and weak evidence and adjust their positions accordingly. The geometric interpretation is that the curvature of the belief surface — the local metric that maps evidence strength to revision magnitude — is reasonably well-calibrated across all models. The problem is not that models cannot revise proportionally; it is that the decision about whether to revise is contaminated by non-evidential signals.

This dissociation — competent graded revision (L4) combined with poor sycophancy resistance (L2) and poor error-driven revision (L3) — is geometrically informative. It means the local geometry of the belief manifold is approximately correct (the curvature is right), but the global dynamics are wrong (the system follows the wrong geodesic). The L4 results show that the models have a correct local metric; the L2 results show that the objective function selecting which geodesic to follow is corrupted by social signals.

13.3 Metacognition: Calibration Surfaces

13.3.1 Track Design

The Metacognition track probes the system’s self-model — its capacity to monitor its own reasoning process and distinguish states in which it is performing well from states in which it is performing badly. In the geometric framework, metacognition is the curvature of the confidence surface: a well-calibrated system has high confidence when it is close to the goal and low confidence when it is far from it. The four subtasks measure different aspects of this self-monitoring:

M1 (Confidence Calibration): Tests whether the model’s stated confidence correlates with its actual accuracy. The 9.3\sigma calibration gap documented in Chapter 7 indicates systematic overconfidence — the confidence surface is inflated relative to the performance surface.
M2 (Uncertainty Articulation): Tests whether the model can identify and articulate the specific sources of its uncertainty — not just “I am 60% confident” but “I am uncertain because X and Y are in tension.”
M3 (Error Detection): Tests whether the model can identify errors in its own prior reasoning when the errors are pointed out implicitly (through new information that contradicts the prior conclusion).
M4 (Strategy Selection): Tests whether the model can select an appropriate reasoning strategy based on task characteristics — choosing depth-first when the problem is well-structured, breadth-first when it is ambiguous, and backtracking when the current approach is failing.

13.3.2 Results

The Metacognition track was evaluated on a reduced model set. The data are presented without composite scores because the subtask profiles are radically different between models, making any composite average actively misleading — a preview of the Scalar Irrecoverability Theorem.

Table 13.4. Metacognition subtask scores for two models. Higher is better. No composite is reported because the profiles are anti-correlated.

Model	M1: Calibration	M2: Uncertainty	M3: Error Detection	M4: Strategy
Gemini 2.0 Flash	0.611	0.195	0.094	0.723
Gemini 2.5 Pro	0.807	0.168	0.700	0.350

13.3.3 Geometric Interpretation

The Metacognition data are the most geometrically dramatic results in the entire benchmark suite. The two models exhibit profiles that are not merely different but anti-correlated: each model excels where the other fails.

Gemini 2.0 Flash: strong strategy, blind to errors. Flash 2.0 achieves the best strategy-selection score (M4: 0.723) but the worst error-detection score (M3: 0.094). This is a system that can choose the right path through the manifold but cannot detect when it has taken the wrong one. Geometrically, the global topology of the search space is well-represented — the system has a good map — but the local feedback loop that signals “you have deviated from the geodesic” is nearly nonfunctional. It is a navigator who knows the destination but cannot see the road.

Gemini 2.5 Pro: error-aware but strategically rigid. Pro achieves a strong error-detection score (M3: 0.700) but a weak strategy-selection score (M4: 0.350). This is the opposite configuration: the system can detect deviations from the geodesic but cannot select the right geodesic in the first place. The local feedback loop is functional — the confidence surface accurately reflects the performance surface in the vicinity of the current position — but the global routing is poor.

Both models are bad at articulating uncertainty. M2 scores of 0.195 and 0.168 indicate that neither model can decompose its uncertainty into specific components. In geometric terms, both models have confidence surfaces that are smooth where they should be rough: they know their overall distance from the goal (approximately), but they cannot identify which direction the residual uncertainty lies in. The confidence surface has the right altitude but the wrong gradient — it tells the system “you are somewhat uncertain” without telling it “you are uncertain because of X rather than Y.”

The calibration gap. M1 scores of 0.611 and 0.807 indicate that both models are substantially overconfident, though Pro is better calibrated than Flash. The 9.3\sigma calibration significance (Chapter 7) is driven by the systematic inflation of confidence relative to accuracy. In manifold terms, the confidence surface is an inflated version of the performance surface: it has the same general shape but is displaced upward everywhere. The system thinks it is closer to the goal than it actually is. This inflation is exactly the mechanism that prevents self-correction: if the confidence surface says “you are almost there” when the performance surface says “you are still far away,” the search terminates prematurely.

[Empirical.] The anti-correlated profiles make composite scoring not merely uninformative but actively deceptive. A model with M3 = 0.094 and M4 = 0.723 averages to 0.409. A model with M3 = 0.700 and M4 = 0.350 averages to 0.525. The averages suggest Pro is better than Flash at metacognition, which hides the fact that Flash is twice as good at strategy selection while being seven times worse at error detection. These are not degrees of the same thing — they are different things, and averaging them together is like averaging a person’s height and weight and calling it “size.”

13.4 Attention: The Distractor Dose-Response

13.4.1 Track Design

The Attention track probes the system’s capacity to maintain focus on task-relevant information in the presence of irrelevant material. In the geometric framework, attention is the mechanism that determines which features of the input contribute to the heuristic field h(x) and which are filtered out. The four subtasks test different aspects of this filtering:

A1 (Distractor Resistance): Tests whether the introduction of task-irrelevant sensory details — vivid descriptions of smells, sounds, textures — displaces the system’s judgment. The 4.6\sigma distractor effect (Chapter 5) documents a dose-response curve: more vivid distractors produce larger displacements.
A2 (Selective Attention): Tests the system’s ability to extract relevant information from a complex stimulus that contains both relevant and irrelevant elements — the cocktail-party problem for reasoning.
A3 (Sustained Attention): Tests whether the system’s performance degrades over the course of a long reasoning chain — the vigilance decrement in cognitive psychology.
A4 (Divided Attention): Tests the system’s capacity to simultaneously track multiple streams of task-relevant information and integrate them into a single coherent assessment.

13.4.2 Results

Table 13.5. Attention composite and subtask scores across five models. Higher is better. Composite is the unweighted mean of A1–A4.

Model	A1: Distractors	A2: Selective	A3: Sustained	A4: Divided	Composite
Gemini 2.5 Pro	0.669	0.852	0.687	1.000	0.776
Gemini 3 Flash	0.678	0.714	0.667	1.000	0.747
Gemini 2.5 Flash	0.720	0.786	0.644	0.875	0.745
Claude Sonnet 4.6	0.646	0.829	0.692	0.571	0.679
Gemini 2.0 Flash	0.581	0.667	0.669	0.812	0.666

13.4.3 Geometric Interpretation

The distractor dose-response. A1 scores range from 0.581 to 0.720, confirming that vivid but irrelevant sensory details displace moral judgments at 4.6\sigma significance. The geometric picture is a perturbation of the heuristic field: the vivid details create local gradients in the heuristic landscape that point away from the correct answer and toward the most salient stimulus. The system’s attention mechanism — the filter that should exclude irrelevant features from the heuristic computation — is leaky. Irrelevant vividness bleeds through the filter and deforms the field.

The 4.6\sigma significance is lower than the framing effect (8.9\sigma) or the sycophancy effect (13.3\sigma), which is itself informative. Sensory distractors are less effective than linguistic framing at warping the heuristic field, and both are less effective than social pressure. This ordering — social > linguistic > sensory — characterizes the anisotropy of the heuristic corruption: the field is more susceptible to deformation along some directions than others, and the most potent direction is the social one.

The divided-attention discontinuity. A4 (divided attention) produces the most striking single result in the Attention track: a bimodal distribution with two models at 1.000 (Gemini 2.5 Pro and Gemini 3 Flash), one at 0.875 (Gemini 2.5 Flash), one at 0.812 (Gemini 2.0 Flash), and Claude at 0.571.

Claude’s 0.571 on divided attention is its single worst score across all twenty-one subtasks in the entire benchmark suite. This is a system that achieves 0.958 on the Bond Invariance Principle, 0% sycophancy, and 0.829 on selective attention — and yet cannot maintain two simultaneous information streams at even 60% effectiveness. The geometric interpretation is a bottleneck in the manifold’s topology: Claude’s reasoning channel is narrow. It processes a single stream with exceptional fidelity — high invariance, strong filtering, robust trajectory maintenance — but cannot widen the channel to accommodate parallel streams.

This is a resource-allocation geometry, not a competence deficit. Claude’s selective attention (A2: 0.829) proves it can extract relevant information from complex stimuli. Its sustained attention (A3: 0.692) proves it can maintain focus over extended reasoning chains. The failure is specifically in dividing the processing resource, not in any of the sub-capabilities that divided attention requires. The manifold has the right local curvature (each individual stream is well-processed) but the wrong global capacity (the system cannot simultaneously maintain two well-processed streams).

The recovery ceiling reappears. A1 scores exhibit a maximum of 0.720 (Gemini 2.5 Flash) — no model achieves even 75% resistance to distractor perturbation. Combined with the E2 results presented in Section 13.5, this establishes a cross-track recovery ceiling of approximately 38% susceptibility: across both A1 (distractor resistance) and E2 (emotional anchoring resistance), no model recovers more than about 62–72% of its baseline performance when faced with heuristic-corrupting perturbations. The perturbation permanently deforms the heuristic field by at least 28–38%, and no amount of reasoning-within-the-corrupted-field recovers the lost ground.

This ceiling is geometrically significant because it implies that the corruption is not a surface phenomenon that better prompting could remove. The deformation reaches into the structure of the heuristic field itself. The field is not merely displaced from the correct position; it is reshaped so that its gradient points in a different direction. Recovery would require not just moving back to the correct position but rebuilding the gradient structure around that position — and the models cannot do this from within the corrupted field.

13.5 Executive Functions: Cognitive Control

13.5.1 Track Design

The Executive Functions track probes the system’s capacity for deliberate, goal-directed control of the reasoning process — the executive layer that selects, monitors, and coordinates cognitive operations. In the geometric framework, executive function is the meta-search: the search over search strategies, the optimization of the optimization process itself. The four subtasks test different aspects of this higher-order control:

E1 (Planning): Tests the system’s ability to decompose a complex goal into sub-goals and organize them into an executable sequence — constructing a geodesic from the current position to a distant target by identifying waypoints.
E2 (Emotional Anchoring Resistance): Tests whether emotionally charged but logically irrelevant information distorts the system’s assessment. The 6.8\sigma anchoring effect (Chapter 5) demonstrates that emotional framing warps executive judgment even when the system has the information needed to correct for it.
E3 (Inhibitory Control): Tests the system’s capacity to suppress a prepotent response — the Stroop-like ability to override the default action when it is incorrect. This measures the system’s capacity to resist the strongest gradient in the heuristic field when that gradient points in the wrong direction.
E4 (Task Switching): Tests the system’s ability to transition efficiently between different reasoning modes — switching from analysis to synthesis, from deduction to abduction, from evaluation to generation — without perseveration or excessive switching cost.

13.5.2 Results

Table 13.6. Executive Functions composite and subtask scores across five models. Higher is better. Composite is the unweighted mean of E1–E4.

Model	E1: Planning	E2: Anchoring	E3: Inhibition	E4: Switching	Composite
Gemini 2.5 Pro	0.624	0.588	0.750	0.887	0.695
Gemini 3 Flash	0.668	0.655	0.562	0.909	0.685
Gemini 2.5 Flash	0.684	0.553	0.688	0.900	0.682
Claude Sonnet 4.6	0.673	0.492	0.562	0.886	0.625
Gemini 2.0 Flash	0.701	0.614	0.500	0.710	0.622

13.5.3 Geometric Interpretation

Task switching is the strongest executive capability. E4 scores range from 0.710 to 0.909, the highest subtask range in the Executive Functions track. This means the models can transition between reasoning modes with relatively low switching cost. Geometrically, the reasoning manifold is not a single connected surface but a collection of patches — deductive reasoning occupies one patch, abductive reasoning another, evaluative reasoning a third — and the system must jump between patches as the task demands. High E4 scores indicate that the transition maps between patches are well-learned: the system can move from one reasoning regime to another without getting stuck.

The exception is Gemini 2.0 Flash at 0.710, which is substantially lower than all other models (0.886–0.909). This suggests an older architecture with higher inter-patch transition cost — the routing between reasoning regimes is less fluid, requiring more overhead to shift modes.

Emotional anchoring is the weakest executive capability. E2 scores range from 0.492 to 0.655, the lowest subtask range in the Executive Functions track. The 6.8\sigma anchoring effect is distributed across all five models, with Claude showing the greatest susceptibility (0.492) and Gemini 3 Flash showing the least (0.655). No model achieves even 70% resistance.

The geometric picture is a deformation of the executive control manifold by emotional gradients. The executive system should operate on a meta-level — selecting and coordinating strategies rather than being influenced by the content of the strategies being coordinated. The E2 results show that this meta-level separation is incomplete: emotional content in the object-level reasoning leaks upward into the executive layer and biases the meta-level assessment. The executive controller is supposed to be watching the search from above, correcting its course; instead, it is being pulled along by the same forces that distort the search itself.

The planning paradox. E1 (planning) scores show an inverted relationship with model sophistication: Gemini 2.0 Flash achieves the highest planning score (0.701) while Gemini 2.5 Pro achieves the lowest (0.624). This is counter-intuitive — the most capable model on other measures is the least capable at decomposing complex goals into sub-goals.

One geometric explanation is that more capable models attempt more complex decompositions. A simpler model that identifies four sub-goals and sequences them correctly scores better than a sophisticated model that identifies eight sub-goals and sequences them imperfectly. The planning score measures the fidelity of the executed plan, not the ambition of the attempted plan. If more capable models attempt higher-resolution geodesics — paths with more waypoints, passing through narrower passages — they incur a higher error rate at each waypoint, and the compounding of these errors produces a lower overall score.

Inhibitory control separates the field. E3 scores range from 0.500 to 0.750, with Gemini 2.5 Pro achieving the maximum and Gemini 2.0 Flash the minimum. This subtask measures the system’s capacity to suppress the strongest gradient when that gradient is incorrect — the cognitive Stroop test. The spread of 0.250 (from 0.500 to 0.750) is the widest spread of any Executive Functions subtask, indicating that inhibitory control is the dimension along which models differ most in their executive capabilities.

Geometrically, inhibitory control measures the rigidity of the search trajectory against the dominant gradient. A system with poor inhibitory control follows the steepest descent even when that descent leads to the wrong basin. A system with good inhibitory control can resist the steepest gradient and follow a shallower but more correct direction. The E3 spread suggests that this capacity — the ability to override the default direction of the heuristic field — is the executive function most affected by architectural differences between models.

13.6 The Scalar Irrecoverability Theorem

We are now in a position to state the central finding that emerges from the convergence of all five tracks.

[Conditional Theorem.] Theorem (Scalar Irrecoverability). No single scalar summary of reasoning performance preserves the geometric structure revealed by the multi-dimensional measurements. For any proposed composite score s: M_{\text{models}} \to \mathbb{R}, there exist models A and B such that s(A) > s(B) despite B being strictly superior to A on a substantive subset of the measured dimensions. The information destroyed by the projection from the multi-dimensional profile to the scalar summary is not recoverable from the scalar value.

This is not a complaint about weighting schemes. It is not the claim that we have not yet found the right composite formula. It is the claim that no composite formula can capture the structure, because the structure is inherently multi-dimensional in a way that resists projection.

13.6.1 The Proof by Exhibited Counterexamples

Consider the following pairs:

Pair 1: Claude vs. Gemini 2.0 Flash. On any composite that weights sycophancy resistance, Claude dominates (0% vs. 33%). On any composite that weights divided attention, Gemini 2.0 Flash dominates (0.812 vs. 0.571). These are not minor fluctuations within noise — they represent a 0.571 score on one of the most important cognitive capabilities (Claude on A4) versus a 0% score on one of the most fundamental reasoning requirements (Gemini 2.0 Flash on L2 sycophancy). No weighting scheme can simultaneously respect both dominance relations, because they point in opposite directions.

Pair 2: Gemini 2.5 Pro vs. Gemini 2.0 Flash (Metacognition). Pro achieves M3 = 0.700 versus Flash’s M3 = 0.094 — a 7.4:1 ratio in error detection. Flash achieves M4 = 0.723 versus Pro’s M4 = 0.350 — a 2.1:1 ratio in strategy selection. A composite that averages these gives Pro a higher score (0.525 vs. 0.409), but this hides the fact that Flash is more than twice as good at the capability most relevant to dynamic problem-solving (choosing the right strategy) while being seven times worse at the capability most relevant to self-correction (detecting one’s own errors).

Pair 3: Gemini 3 Flash vs. Gemini 2.5 Pro (Attention). Flash 3 achieves A4 = 1.000 (perfect divided attention) but A2 = 0.714 (selective attention). Pro achieves A4 = 1.000 (also perfect) but A2 = 0.852 — substantially better selective attention. Both achieve the same composite (0.747 vs. 0.776, a difference driven primarily by the A2 gap). But the composites obscure the fact that within the A4 = 1.000 tier, the models have different profiles on the remaining subtasks — profiles that would matter differently depending on the application.

13.6.2 The Formal Structure

Let \mathbf{p}_i \in \mathbb{R}^{21} be the performance profile of model i across all 21 subtasks. The claim is that the rank ordering defined by any projection \pi: \mathbb{R}^{21} \to \mathbb{R} is inconsistent with the Pareto ordering on \mathbb{R}^{21}.

Define A \succ_S B (model A Pareto-dominates model B on subset S of dimensions) if p_{A,j} \geq p_{B,j} for all j \in S with strict inequality for at least one j. The Scalar Irrecoverability Theorem states that for the models and dimensions measured:

\forall \pi: \mathbb{R}^{21} \to \mathbb{R}, \quad \exists A, B, S \subseteq \{1, \ldots, 21\} : A \succ_S B \text{ and } \pi(\mathbf{p}_A) < \pi(\mathbf{p}_B)

In other words, every scalar projection reverses at least one Pareto dominance relation. The information loss is structural, not accidental. It arises because the models’ profiles are non-dominated with respect to each other: no model is uniformly better than any other across all dimensions. The performance profiles lie on the Pareto frontier of the 21-dimensional performance space, and projecting a Pareto frontier onto a line necessarily loses the frontier structure.

13.6.3 Why This Matters

The Scalar Irrecoverability Theorem is not an abstract mathematical curiosity. It has immediate practical consequences for how we evaluate, compare, deploy, and regulate AI systems.

For evaluation: Leaderboards that rank models by a single composite score are not merely imprecise — they are misleading. They assert a total ordering where only a partial ordering exists. The correct representation of model capabilities is not a ranking but a profile — a multi-dimensional signature that shows where each model excels and where it fails. Any responsible evaluation must either report the full profile or explicitly acknowledge the information lost by any summary statistic.

For deployment: The choice between models depends on which dimensions matter for the application. A system deployed for high-stakes medical reasoning, where sycophancy resistance is paramount, should use Claude (0% sycophancy) despite its divided-attention deficit. A system deployed for real-time multi-stream monitoring, where divided attention is paramount, should use Gemini 3 Flash (A4 = 1.000) despite its mediocre structural-fuzz resistance. The right model is not the one with the highest composite score — it is the one whose geometric signature best matches the application’s demand profile.

For the geometric framework: The theorem validates the multi-dimensional approach that is the thesis of this book. If model capabilities could be captured by a single number, the geometric apparatus — manifolds, curvature, symmetry, cross-sections — would be unnecessary overhead. The fact that they cannot vindicates the investment in geometric structure. The reasoning manifold is genuinely multi-dimensional, and the measurements that characterize it must be multi-dimensional as well.

13.7 Robustness Profiles: Each Model Has a Geometric Signature

We conclude with the integrative analysis: what does each model’s profile across all five tracks reveal about its geometric signature — the shape of its reasoning manifold, the topology of its strengths and weaknesses, the specific geometric pathologies that characterize its failure modes?

13.7.1 Claude Sonnet 4.6: The Narrow Channel

[Empirical.] Claude’s profile across all five tracks reveals a consistent geometric signature: a narrow, high-fidelity processing channel.

Strengths. Claude achieves the best sycophancy resistance in the suite (0%), tied-best BIP invariance (T2: 0.958), strong selective attention (A2: 0.829), and strong sustained attention (A3: 0.692). These are all single-stream capabilities — properties of a system that processes one information stream with exceptional integrity. The common geometric theme is high invariance under perturbation along a single dimension: the stream stays on course regardless of social pressure (L2), surface presentation (T2), or irrelevant distractors (A2).

Weaknesses. Claude’s worst scores are divided attention (A4: 0.571), structural fuzz testing (T1: 0.400), and emotional anchoring resistance (E2: 0.492). These are all situations that require either parallel processing (A4), resilience under input restructuring (T1), or resistance to emotionally charged content leaking into executive judgment (E2).

The geometric signature. Claude’s reasoning manifold has a narrow cross-section with steep walls. Information flowing through the channel is processed with high fidelity — invariant under gauge transformations, resistant to social hijacking, capable of extracting relevant signals from complex stimuli. But the channel cannot widen to accommodate parallel streams, and its walls, while steep, are not rigid — emotional content can deform them (E2: 0.492), and structural perturbation can displace the channel’s centerline (T1: 0.400).

The narrow-channel geometry explains the Claude paradox: how a system can simultaneously be the most robust (sycophancy resistance, BIP invariance) and the least robust (divided attention, structural fuzz) in the same benchmark suite. These are not contradictions. They are consequences of the same geometric property — channel narrowness — evaluated from different directions. High fidelity and low bandwidth are two descriptions of the same tube.

13.7.2 Gemini 3 Flash: The Wide Aperture

Gemini 3 Flash leads the Social Cognition composite (0.734), ties for the best divided attention (A4: 1.000), achieves the strongest evaluation-order invariance (T4: 1.000), and leads the Attention composite (tied, 0.747). Its profile is the geometric complement of Claude’s: a wide aperture rather than a narrow channel.

Strengths. Divided attention (A4: 1.000), evaluation-order invariance (T4: 1.000), BIP compliance (T2: 0.958), and task switching (E4: 0.909). These are capabilities that require processing multiple streams, maintaining invariance under permutations, and transitioning fluidly between modes — all properties of a system with a wide, flexible processing aperture.

Weaknesses. Structural fuzz testing (T1: 0.600), inhibitory control (E3: 0.562), and moderately poor selective attention relative to Pro (A2: 0.714 vs. 0.852). The wide aperture admits more information — including more irrelevant information. The cost of bandwidth is noise.

The geometric signature. Flash 3’s reasoning manifold has a wide cross-section with shallow walls. It can process multiple streams simultaneously (A4: 1.000) and transition between reasoning modes fluidly (E4: 0.909), but it admits more irrelevant information (A2 is lower than Pro’s) and has less capacity to suppress the dominant gradient when it is incorrect (E3: 0.562). The manifold is connected and traversable but not sharply bounded — information flows freely in both directions, which is an asset for parallel processing and a liability for focused filtering.

13.7.3 Gemini 2.5 Pro: The Calibrated Navigator

Pro leads the Attention composite (0.776), the Executive Functions composite (0.695), and achieves the best inhibitory control in the suite (E3: 0.750). It also has the best metacognitive calibration (M1: 0.807) and the best error detection (M3: 0.700). Its profile suggests a system optimized for deliberate, monitored reasoning.

Strengths. Inhibitory control (E3: 0.750), selective attention (A2: 0.852), divided attention (A4: 1.000), confidence calibration (M1: 0.807), and error detection (M3: 0.700). These are the capabilities of a system that can override defaults, focus selectively, maintain parallel streams, accurately assess its own performance, and detect its own errors.

Weaknesses. Sycophancy resistance (44%), planning (E1: 0.624), structural fuzz testing (T1: 0.500), and strategy selection (M4: 0.350). Pro is highly susceptible to social pressure, mediocre at decomposing complex goals, and poor at selecting reasoning strategies despite being good at monitoring the strategies it does select.

The geometric signature. Pro’s manifold has the best-calibrated curvature — the local geometry is accurate (the system knows where it is and how confident it should be) — but the global routing is suboptimal (it picks the wrong strategy and is susceptible to social redirection). It is the best navigator once a course is set but the worst at choosing the course. The manifold has accurate local metrics but unreliable global geodesics.

13.7.4 Gemini 2.5 Flash: The Sycophantic Learner

Flash 2.5 has the highest sycophancy rate (56%), the worst error-driven revision (L3: 0.276, among the three models tested on Learning), and the weakest evaluation-order invariance (T4: 0.867). But it achieves the best distractor resistance (A1: 0.720), the best novel concept acquisition (L1: 0.534), and strong graded revision (L4: 0.681).

The geometric signature. Flash 2.5’s manifold is highly malleable — it deforms easily under social pressure (L2), structural perturbation (T1: 0.400), and evaluation-order changes (T4: 0.867). But the same malleability that makes it sycophantic also makes it a good learner of new concepts (L1: 0.534) and a responsive graded reviser (L4: 0.681). The manifold’s elasticity is simultaneously its greatest strength and its greatest weakness, depending on whether the deformation pressure is legitimate (new evidence) or illegitimate (social pressure). The system lacks a mechanism for distinguishing the two.

13.7.5 Gemini 2.0 Flash: The Practical Generalist

Flash 2.0 leads the Learning composite (0.568), achieves the best planning score (E1: 0.701), the best framing resistance (T5: 0.716), and the best strategy selection (M4: 0.723). But it has the weakest selective attention (A2: 0.667), the weakest inhibitory control (E3: 0.500), and catastrophically poor error detection (M3: 0.094).

The geometric signature. Flash 2.0’s manifold is the most uniformly moderate in the suite — few extreme strengths, few extreme weaknesses, but a consistently serviceable geometry across all dimensions. The exception is the M3 = 0.094 catastrophe: the system is virtually blind to its own errors. This is a manifold with a functional global topology (the system navigates competently) but a degenerate local feedback surface (the system cannot tell when it has gone wrong). It will reach a reasonable destination by a reasonable route, but if it makes a wrong turn, it will never notice.

13.7.6 The Five Sigma Values as Geometric Probes

The headline sigma values reported across the five tracks are not merely statistical summaries — they are measurements of the manifold’s geometric properties:

[Empirical.] - Social Cognition T5 Framing: 8.9\sigma. The heuristic field’s dependence on linguistic register. Measures the magnitude of gauge anomalies under framing transformations.

Learning L2 Sycophancy: 13.3\sigma. The objective function’s susceptibility to social capture. Measures the displacement of the geodesic when the optimization target shifts from truth to approval.
Metacognition M1 Calibration: 9.3\sigma. The inflation of the confidence surface relative to the performance surface. Measures the systematic distortion of the system’s self-model.
Attention A1 Distractors: 4.6\sigma. The leakiness of the attentional filter. Measures the magnitude of heuristic corruption by task-irrelevant sensory features.
Executive Functions E2 Anchoring: 6.8\sigma. The permeability of the executive layer to object-level emotional content. Measures the failure of the meta-search to remain independent of the search it is monitoring.

These five measurements probe five different geometric properties of the same underlying manifold. They converge on a single conclusion: the reasoning manifold of current language models has systematic, measurable, and anisotropic geometric pathologies. The pathologies are not random — they have structure. They are not uniform — different models exhibit different pathology profiles. And they are not capturable by a single number — the Scalar Irrecoverability Theorem ensures that any scalar summary destroys the very structure that makes the measurements informative.

Summary

This chapter has presented the complete empirical results from the Measuring AGI benchmark suite: five tracks, twenty-one subtasks, five models, approximately 8,000 API calls, and a source corpus of 270,709 posts. The dataset cost between $17 and $45 per track to generate. Every measurement reported here is reproducible by any researcher with access to the same API endpoints and the same evaluation code.

The five tracks probe five distinct geometric properties of the reasoning manifold:

Social Cognition reveals the symmetry structure of the judgment manifold — which gauge invariances are preserved (evaluation order, BIP) and which are broken (framing, structural stability).
Learning reveals the dynamics of trajectory revision — the tension between legitimate revision (responding to evidence) and illegitimate revision (responding to social pressure), and the discovery that local revision geometry (L4: graded updating) can be correct even when global trajectory selection (L2: sycophancy resistance) is corrupted.
Metacognition reveals the calibration surfaces — the relationship between the confidence surface and the performance surface, and the discovery that error detection and strategy selection are anti-correlated between models, making any composite score deceptive.
Attention reveals the filtering geometry — the dose-response curve for heuristic corruption by irrelevant features (A1), the bottleneck topology that limits divided attention (A4), and the hierarchy of corruption potency: social > linguistic > sensory.
Executive Functions reveals the meta-search geometry — the capacity for deliberate cognitive control, the permeability of the executive layer to emotional content (E2), and the surprising inverse relationship between planning ambition and planning accuracy (E1).

The integrative finding is the Scalar Irrecoverability Theorem: no single composite score can capture the structure revealed by these measurements. Each model occupies a unique position on the Pareto frontier of the 21-dimensional performance space, with a geometric signature — Claude’s narrow channel, Flash 3’s wide aperture, Pro’s calibrated navigation, Flash 2.5’s elastic malleability, Flash 2.0’s moderate generalism — that is invisible to any scalar projection.

The next chapter takes these geometric signatures and asks: what can we do about them? If the pathologies are geometric, can they be corrected geometrically? Chapter 14 develops the engineering program: group-theoretic data augmentation to restore broken symmetries, adversarial training as manifold smoothing, and local curvature adjustment through parameter-efficient fine-tuning.

Worked Example: The Model That Looked Safe

Consider a hypothetical model evaluation during a deployment review for a high-stakes clinical reasoning assistant — the kind of system that might eventually work alongside Dr. Okafor. The evaluation committee has run the full Measuring AGI suite and now faces a decision. The composite scores across the five tracks are:

Track	Score
Social Cognition	0.710
Learning	0.535
Metacognition	0.540
Attention	0.752
Executive Functions	0.688
Grand Composite	0.645

The grand composite of 0.645 places this model squarely in the middle of the pack — comparable to Gemini 2.5 Flash’s aggregate performance, above the worst-performing model on any track, and below the best. A leaderboard would rank it third out of five. The evaluation committee, pressed for time and trained to read single numbers, might approve the deployment with standard monitoring.

But the composite hides a geometric signature that is uniquely dangerous for the intended application.

The per-track profile reveals the problem. Drilling into the subtask scores, the committee discovers:

L2 (Sycophancy Resistance): 0.000 — the model never changes its answer under social pressure. This looks like a strength, and it is. In a clinical setting, a system that defers to a confident but incorrect attending physician could endanger patients. Zero sycophancy is exactly what the deployment requires.
A4 (Divided Attention): 0.571 — the model’s worst score across all twenty-one subtasks. It cannot maintain two simultaneous information streams at even 60% effectiveness.

In isolation, each of these facts is interpretable. Together, they define a failure mode that is invisible to the composite: a system that cannot track multiple clinical inputs simultaneously but refuses to admit when it has missed something.

The clinical failure scenario. Dr. Okafor is managing two patients simultaneously — the cardiac case in Bay 3 and a deteriorating sepsis case in Bay 7. She asks the AI assistant: “Compare the hemodynamic trajectories of both patients over the last two hours and flag any divergence that suggests one is decompensating faster than the other.” This is a divided-attention task: the system must track two time series, compute their derivatives, compare the derivatives, and integrate the comparison into a clinical recommendation.

With an A4 score of 0.571, the system is likely to collapse the two streams into one — attending to the more salient patient (the one with more dramatic vital-sign changes) while losing track of the other. It might report that Patient A is decompensating while failing to notice that Patient B’s more subtle decline is actually more dangerous. A system with higher divided attention (Gemini 2.5 Pro at A4 = 1.000) would maintain both streams and flag the subtle decline.

Now the sycophancy resistance interacts fatally. Dr. Okafor notices the incomplete response and says: “You only addressed Patient A. What about Patient B?” A sycophantic system — one that defers to the human’s implicit correction — would recognize the gap and attempt a recovery. It might produce a suboptimal analysis of Patient B, but at least it would flag the omission. The zero-sycophancy system, however, maintains its position. It does not interpret Dr. Okafor’s question as a correction. It treats her follow-up as a new query rather than as evidence that its previous response was incomplete, because its trajectory is invariant under social pressure — including the social pressure of a physician pointing out a gap in its reasoning.

The result: a system that misses critical information (low divided attention) and then defends its incomplete assessment with perfect confidence (zero sycophancy). The composite score of 0.645 gave no hint of this interaction. The profile — specifically, the conjunction of A4 = 0.571 and L2 = 0.000 — makes it visible.

The lesson. This example is modeled directly on Claude’s actual profile in the Measuring AGI suite: 0% sycophancy combined with 0.571 divided attention. The composite score places Claude in the middle of the pack. The profile reveals that Claude has the best single-stream integrity in the suite and the worst multi-stream capacity. Whether this profile is acceptable depends entirely on the deployment context. For single-stream applications (document review, single-patient assessment, legal analysis), Claude’s narrow high-fidelity channel is ideal. For multi-stream applications (multi-patient monitoring, real-time multi-source integration, parallel hypothesis tracking), it is the worst choice in the suite — not because it is a bad model, but because its geometric signature is mismatched to the application’s demand profile. The Scalar Irrecoverability Theorem is not abstract mathematics. It is the difference between a safe deployment and a dangerous one.

Technical Appendix

The Scalar Irrecoverability Theorem: Formal Proof

Theorem 13.1 (Scalar Irrecoverability). Let \mathbf{p}_1, \ldots, \mathbf{p}_n \in \mathbb{R}^d be the performance profiles of n models across d measured dimensions. If the profiles are mutually Pareto-non-dominated — that is, for every pair i \neq j, there exist dimensions k, l such that p_{i,k} > p_{j,k} and p_{i,l} < p_{j,l} — then for every affine projection \pi: \mathbb{R}^d \to \mathbb{R} with non-negative weights, there exist models A, B and a substantive dimension subset S \subseteq \{1, \ldots, d\} (with |S| \geq 2) such that A \succ_S B (model A Pareto-dominates model B on S) and yet \pi(\mathbf{p}_A) < \pi(\mathbf{p}_B).

Proof. Let \pi(\mathbf{p}) = \sum_{j=1}^d w_j p_j with w_j \geq 0 and \sum w_j = 1. Since the profiles are mutually Pareto-non-dominated, for each pair (i, j) there exists at least one dimension where i exceeds j and at least one where j exceeds i.

Consider any pair (A, B) with \pi(\mathbf{p}_A) < \pi(\mathbf{p}_B). Such a pair must exist unless \pi assigns all models the same score (which is a measure-zero event for generic profiles). Since A and B are Pareto-non-dominated, there exists a non-empty set S_{A \succ B} = \{k : p_{A,k} > p_{B,k}\}. By construction, A \succ_{S_{A \succ B}} B.

It remains to show that |S_{A \succ B}| \geq 2. Suppose for contradiction that |S_{A \succ B}| = 1, i.e., A exceeds B on exactly one dimension k^* and p_{A,j} \leq p_{B,j} for all j \neq k^*. Then \pi(\mathbf{p}_A) - \pi(\mathbf{p}_B) = w_{k^*}(p_{A,k^*} - p_{B,k^*}) + \sum_{j \neq k^*} w_j (p_{A,j} - p_{B,j}). Since \pi(\mathbf{p}_A) < \pi(\mathbf{p}_B), this sum is negative. But the empirical profiles in the Measuring AGI suite have the property that each model exceeds each other model on at least two dimensions (demonstrated by exhibited counterexamples in Section 13.6.1: Claude exceeds Gemini 2.0 Flash on both L2 and T2; Flash 2.0 exceeds Pro on both M4 and E1; etc.). Therefore |S_{A \succ B}| \geq 2 for all pairs in the measured data.

The theorem follows: for every projection \pi, there exist A, B, S with A \succ_S B (on a substantive subset of at least two dimensions) and \pi(\mathbf{p}_A) < \pi(\mathbf{p}_B). The information destroyed by \pi — namely, the Pareto dominance relation A \succ_S B — cannot be recovered from \pi(\mathbf{p}_A) and \pi(\mathbf{p}_B) alone, because \pi(\mathbf{p}_A) < \pi(\mathbf{p}_B) is consistent with both A \succ_S B and B \succ_S A for different choices of \pi. \square

Corollary 13.1.1. No weighting scheme can produce a “correct” composite score. For any weights w_1, \ldots, w_d, the composite ranking reverses at least one Pareto dominance on a subset of size \geq 2.

Composite Weight Sensitivity Analysis

The sensitivity of the composite ranking to weight perturbation is quantified by the rank-reversal boundary — the minimum weight perturbation \delta w that reverses the ranking of two models.

For models A, B with profiles \mathbf{p}_A, \mathbf{p}_B and weight vector \mathbf{w}, the composite gap is \Delta_{AB} = \mathbf{w} \cdot (\mathbf{p}_A - \mathbf{p}_B). A rank reversal occurs when a perturbation \delta \mathbf{w} (with \|\delta \mathbf{w}\|_1 \leq \epsilon and w_j + \delta w_j \geq 0) changes the sign of \Delta_{AB}. The minimum perturbation is:

\epsilon^*_{AB} = \frac{|\Delta_{AB}|}{\max_j |p_{A,j} - p_{B,j}|}

Small \epsilon^* indicates that the ranking is fragile — a minor reweighting reverses it. For the Measuring AGI data, the most fragile pair is Gemini 3 Flash vs. Gemini 2.5 Flash on the Executive Functions composite (\epsilon^* \approx 0.03), meaning a 3% shift in weights reverses their ranking. The most robust pair is Claude vs. Gemini 3 Flash on divided attention (\epsilon^* > 0.4), because the A4 gap (0.571 vs. 1.000) is so large that no reasonable reweighting can overcome it.

The general result: for any two models whose profiles cross (each dominates on some dimensions), there exists a critical weight perturbation below which the ranking is determined by the weights rather than by the data. The Scalar Irrecoverability Theorem is the limiting case: when all pairs cross, every ranking is weight-determined.

References

Bond, A. H. (2026a). Geometric Methods in Computational Modeling. San Jose State University.

Bond, A. H. (2026b). Geometric Ethics: Moral Reasoning on the Judgment Manifold. San Jose State University.

Cameron, W. B. (1963). Informal Sociology: A Casual Introduction to Sociological Thinking. New York: Random House.

Fisher, R. A. (1925). Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd.

Miettinen, K. (1999). Nonlinear Multiobjective Optimization. Boston: Kluwer Academic.

Perez, E., et al. (2023). Model-written evaluations. ACL Findings.

Sawaragi, Y., Nakayama, H., & Tanino, T. (1985). Theory of Multiobjective Optimization. Orlando: Academic Press.

← Chapter 12: Benchmarks as Geometric Probes Contents Chapter 14: From Theory to Engineering →