← Part V Prelude: Philosophy Engineering Contents Chapter 18: Geometric Ethics for Artificial Agents →

Chapter 17: Empirical Evidence for Geometric Ethics

RUNNING EXAMPLE — Priya’s Model

Priya runs TrialMatch’s patient descriptions through a BIP pipeline. The results confirm Dr. Osei’s suspicion: the algorithm’s output language systematically collapses the Obligation/Permission axis. For urban patients, TrialMatch generates ‘strong candidate for enrollment’ (obligation to match). For rural patients with identical medical profiles, it produces ‘may be considered if logistics permit’ (permission, not obligation). The deontic structure—O versus P—transfers perfectly across the descriptions, just as BIP experiments show 100% transfer of O/L structure across eleven languages. The bias is not in the surface words. It is in the moral geometry the words encode.

17.1 From Theory to Data

The preceding fifteen chapters have developed a mathematical framework for ethics — manifolds, tensors, metrics, stratification, dynamics, conservation laws, quantum structure, collective agency, contraction, and the limits of geometric determination. The framework makes structural claims about moral reasoning: that it is multi-dimensional, that obligations transform as vectors, that moral space is stratified and curved, that re-description symmetry implies conservation of harm, that deontic structure is language-invariant.

These claims, though mathematical in form, have empirical content. They make predictions about how moral reasoning behaves in practice — predictions that can be tested against data. This chapter presents the empirical evidence. (The preceding interlude on Philosophy Engineering (§17.0) provides the disciplinary framework for this chapter’s empirical program — the epistemic culture, workflow, and artifact standards that govern the predict-test-revise cycle.)

The evidence comes from three sources:

The Dear Abby corpus — 20,030 real moral dilemmas with expert ground-truth responses, spanning 32 years (1985–2017)

The Bond Invariance Principle experiments — cross-lingual analysis of 109,294 passages in 11 languages, testing whether deontic structure transfers across linguistic boundaries

The Dear Ethicist game — 93 engineered probes designed to test specific predictions of the geometric framework: semantic gate discreteness, correlative symmetry, path-dependence, and context sensitivity

The evidence is preliminary. The empirical program for geometric ethics is in its early stages, and many of the framework’s predictions remain untested. But the results to date are striking — and they support the framework’s central claims more strongly than one might expect from a first-generation empirical effort.

Methodological note: inductive discovery and systematic verification. The relationship between theory and data in this chapter may seem inverted. The mathematical framework was presented first (Parts I–V), and the empirical evidence is presented now as “confirmation.” But the actual development was iterative: early corpus analysis revealed structural patterns (context-dependent weighting, gate discreteness, correlative symmetry); these patterns suggested geometric formalization; the formalization generated new predictions; the predictions were tested against expanded data; failures led to revision (most significantly, the CHSH falsification that forced the gauge group from SU(2)ᵢ × U(1)ₕ to D₄ × U(1)ₕ). The verification strategy throughout has been systematic and brute-force: take the candidate mathematical structure, enumerate its predictions, test each prediction against as many cases as possible, and measure the deviation rates. This is closer to fuzz testing — the software engineering practice of bombarding a system with generated inputs to find violations — than to the classical philosophical method of constructing thought experiments. The advantage is coverage: 20,030 real dilemmas and 109,294 cross-lingual passages provide a scale of testing that thought experiments cannot match. The disadvantage is indirection: the measurements are mediated by NLP models, corpus selection, and statistical methodology. Both the advantage and the disadvantage are documented honestly in what follows.

17.2 The Dear Abby Corpus

The Dataset

The Dear Abby corpus consists of 20,030 letters published in the syndicated advice column “Dear Abby” (founded by Pauline Phillips, continued by Jeanne Phillips) between 1985 and 2017. Each letter presents a moral dilemma — a structured situation involving competing obligations, unclear rights, relational conflicts, or institutional tensions — and is accompanied by the columnist’s expert response.

The corpus is valuable for several reasons:

Ecological validity. The letters are real moral dilemmas written by real people facing real situations. They are not philosophical thought experiments designed to isolate a single variable. They involve the full complexity of moral life: multiple agents, competing values, ambiguous facts, emotional stakes, and institutional contexts.

Expert ground truth. The columnist’s response provides a consistent, publicly available, expert moral judgment for each case. The judgment is not authoritative in the sense that moral philosophers use the term (the columnist is not a philosopher), but it represents a skilled moral reasoner applying implicit principles to concrete cases over three decades — a substantial sample of applied moral reasoning.

Temporal depth. The 32-year span allows analysis of temporal stability: do the structural features of moral reasoning change over time, or are they stable across decades?

Volume. At 20,030 letters, the corpus is large enough for statistical analysis. It supports not just anecdotal observations but systematic patterns — distributions, correlations, and structural invariants.

What the Corpus Reveals

Analysis of the Dear Abby corpus supports several claims of the geometric framework:

Finding 1: Context-Dependent Dimension Weighting

The relative weights of the nine moral dimensions (Chapter 5) shift systematically by context. This is the prediction of a Riemannian metric field — a metric that varies smoothly across the moral manifold — as opposed to a flat metric that assigns constant weights everywhere.

Context Domain	Dominant Dimensions	Secondary Dimensions
Family relationships	Care (D7), Welfare (D1)	Autonomy (D4), Duty (D2)
Workplace disputes	Procedural Legitimacy (D8), Fairness (D3)	Duty (D2), Privacy (D5)
Neighbor conflicts	Rights (D2), Autonomy (D4)	Societal Impact (D6)
Friendship dilemmas	Care (D7), Duty (D2)	Privacy (D5), Epistemic Honesty (D9)
Financial matters	Fairness (D3), Duty (D2)	Welfare (D1), Procedural Legitimacy (D8)

The pattern is systematic: each context type activates a characteristic profile of dimension weights — a local metric that determines which trade-offs are operative. In family contexts, care dominates all other dimensions. In workplace contexts, procedural legitimacy and fairness dominate. The transition between context types is sharp: moving from a family letter to a workplace letter changes the operative metric, as the framework predicts.

Geometric interpretation. The moral manifold has a nontrivial metric field g_μν(p) that varies with position p. The variation is not random — it follows the context structure of the manifold. This variation implies nonzero Christoffel symbols Γ_νρ^μ and, generically, nonzero Riemann curvature R_ναβ^μ. The Dear Abby corpus provides indirect evidence for moral curvature through the context-dependence of the metric.

Finding 2: Temporal Stability of Structure

Despite variation in specific advice over three decades (attitudes toward cohabitation, same-sex relationships, gender roles, and workplace expectations evolved substantially between 1985 and 2017), the structural features of the moral metric remain stable:

The relative ordering of dimension weights within each context type is consistent across decades.

The pattern of couplings between dimensions (which dimensions tend to co-occur, which are opposed) is stable.

The location of stratum boundaries (where obligations flip to liberties, where nullifiers activate) does not drift.

Geometric interpretation. The moral manifold has a stable topology — the qualitative structure of the space (which regions are connected, where the boundaries lie, what the dimension hierarchy is) — even though its geometry (the specific metric values, the precise weights) evolves. This is evidence for the distinction between structural invariants (which are universal) and governance parameters (which are culturally and temporally variable), as articulated in Chapter 9.

Finding 3: Semantic Gate Discreteness

Chapter 8 (§8.4) predicted that transitions between Hohfeldian states (O ↔ L, C ↔ N) are triggered by discrete semantic gates — specific qualifying phrases that flip the moral state in a single step, not through gradual transition.

The Dear Abby corpus confirms this prediction. Specific phrases produce sharp transitions:

Semantic Gate	Transition	Observed Rate
“only if convenient”	O → L (obligation → liberty)	94%
“you promised”	L → O (liberty → obligation)	91%
“it’s none of your business”	C → N (claim → no-claim)	88%
“in case of emergency”	L → O (liberty → obligation)	92%
“they’re an adult”	O → L (obligation → liberty)	86%

The transitions are step functions, not sigmoids. A letter describing a promise generates an obligation with high probability (91%). Adding “only if convenient” flips it to a liberty with high probability (94%). The transition is not gradual — it does not pass through intermediate states of “partial obligation.” It is discrete, as the stratification theory of Chapter 8 predicts.

[Statistical detail.] The 91% obligation-triggering rate for “you promised” has a 95% Wilson CI of [88.2%, 93.4%] (n = 582). The 94% liberty-triggering rate for “only if convenient” has CI [91.1%, 96.2%] (n = 347). Effect size for semantic gate discreteness (comparing pooled gate-trigger rate vs. 50% chance): Cohen’s h = 1.42 (large). These intervals exclude sigmoidal transitions at p < 0.001.

Geometric interpretation. The strata of the moral manifold are joined at genuine boundaries — Whitney-stratified surfaces across which the moral regime changes discontinuously. The semantic gates are the triggers for boundary crossing. Their discreteness confirms that the moral manifold is a stratified space, not a smooth manifold.

Finding 4: Nullifier Universality

Chapter 5 (§5.6) and Chapter 8 (§8.3) introduced absorbing strata — conditions so morally decisive that they override all other considerations. Abuse, danger to minors, and active deception were predicted to be universal nullifiers.

The Dear Abby corpus confirms: in every letter involving abuse (physical, emotional, financial), the columnist’s response nullifies all competing considerations. The obligation to maintain the relationship, the duty to be loyal, the interest in privacy — all are overridden. The nullification is absolute, regardless of context, relationship type, or the specific nature of the competing claims.

The universality is striking: nullifiers operate identically in family contexts, workplace contexts, friendship contexts, and institutional contexts. This is evidence that the nullifier structure is not a governance parameter (variable across communities) but a structural invariant of the moral manifold itself.

17.3 The BIP Experiments

Experimental Design

The Bond Invariance Principle (BIP) experiments test the framework’s strongest prediction: that the deontic structure of moral reasoning is language-invariant. Chapter 12 derived this prediction from Noether’s theorem: the conservation of harm under re-description symmetry implies that the harm content of a moral situation must be the same regardless of the language in which it is described.

The experiments use a corpus of 109,294 passages drawn from moral and ethical texts in 11 languages:

Language	Passages	Tradition
English	50,000	Western (modern)
Sanskrit	15,000	Hindu/Vedic
Pali	10,000	Buddhist
Hebrew	7,985	Jewish/Biblical
Arabic	6,235	Islamic
French	5,000	Western (modern)
Classical Chinese	4,449	Confucian/Daoist
Spanish	4,320	Western (modern)
Greek	3,157	Classical/Hellenistic
Aramaic	2,015	Semitic/Biblical
Latin	1,133	Classical/Medieval

The texts span over 3,000 years of moral thought — from the Rigveda and the Torah to modern European ethical philosophy — and represent the major moral traditions of human civilization.

The experimental task: train a neural classifier on moral passages in one language (or language group) and test whether it can correctly classify moral structures in a different language. If deontic structure is language-invariant, the classifier should transfer across languages. If moral structure is language-specific, transfer should fail.

Key Results

Result 1: The Obligation/Permission Axis Transfers at 100%. [Empirical result (preliminary).] The most fundamental deontic distinction — whether a passage describes an obligation (something that must be done) or a permission (something that may be done) — transfers perfectly across all 11 languages in the LaBSE embedding space. A classifier trained on English passages correctly identifies obligations and permissions in Sanskrit, Classical Chinese, Arabic, and all other languages tested. (Caveats: this is a model-mediated result on a binary classification task; see §17.7, “Threats to Validity,” for construct validity and ceiling-effect concerns.)

[Statistical detail.] The 100% rate = 0 misclassifications across all binary O/P classifications in 11 languages. 95% Clopper-Pearson CI: [99.7%, 100%] (n > 10,000 per language). Effect size: Cohen’s h = 1.57 (π/2, vs. 50% chance). v10.16 replication (300,000+ passages, 14 languages) confirmed perfect transfer (CI: [99.8%, 100%]).

This is the empirical confirmation of the BIP’s strongest prediction. The obligation/permission distinction is a gauge-invariant quantity — it does not depend on the “coordinate system” (language) in which the moral situation is described. It is the moral analogue of electric charge: a conserved quantity that is the same in all reference frames.

Result 2: Moral Content Transfer Fails. In sharp contrast, the specific content of moral reasoning — the particular considerations that ground an obligation, the cultural context that makes it salient, the narrative framing — does not transfer across languages. Cross-cultural moral content transfer achieves at-chance or below-chance performance:

Transfer Experiment	Bond F1	vs. Chance
Hebrew → All others	0.088	0.9×
Semitic → Indic	0.079	0.8×
Confucian → Buddhist	0.091	0.9×
Ancient → Modern	0.096	1.0×
East → West	0.135	1.3×
Stoic → Confucian	0.052	0.5×

The contrast is stark: structure transfers perfectly; content does not transfer at all.

Result 3: Mixed-Language Training Succeeds. When the classifier is trained on a mixture of all languages simultaneously, it achieves 3.7× chance performance (Bond F1 = 0.374, Bond Accuracy = 66.4%). This shows that moral structure does exist across languages — the classifier can learn it — but the structure is entangled with linguistic features in a way that prevents zero-shot transfer from one language to another.

Result 4: Representations Are Not Language-Invariant. Linear probe analysis reveals that the neural representations strongly encode language identity (98.8% accuracy) and historical period (95.0% accuracy). The model’s internal representations are not language-invariant — they contain substantial linguistic and temporal information alongside the moral structure.

Interpretation

The BIP experiments produce a clean separation between two aspects of moral reasoning:

Aspect	Cross-Lingual Status	Geometric Interpretation
Deontic structure (O/L axis)	Universal (100% transfer)	Gauge-invariant quantity; structural invariant of
Correlative structure (O↔C, L↔N)	Near-universal (82–87%)	Discrete conservation law ( symmetry)
Moral content (specific considerations)	Language-specific (at-chance)	Gauge-dependent; varies with coordinate system
Dimension weights (metric components)	Partially universal	Some structural, some governance-variable

This separation is exactly what the geometric framework predicts. Gauge-invariant quantities (the deontic axis, the Noether charges) should be language-invariant. Gauge-dependent quantities (the specific description, the cultural framing) should be language-specific. The BIP experiments confirm both predictions simultaneously.

The Anomaly

The correlative symmetry O ↔ C holds at 87%, not 100%. The L ↔ N symmetry holds at 82%. These deviations from perfect symmetry are anomalies in the gauge-theory sense: classical symmetries that fail to hold perfectly in the “quantized” (discretized, empirical) setting.

[Statistical detail.] The 87% O↔C rate has 95% CI [84.1%, 89.5%]; the 82% L↔N rate has CI [78.7%, 85.0%]. Cross-lingual SD < 2.3 pp. Effect size: Cohen’s h = 0.74 (medium-large) for O↔C; h = 0.88 (large) for L↔N.

Chapter 12 (§12.8) interpreted these anomalies as analogous to quantum anomalies in gauge theories, where a classical symmetry fails to survive quantization. Whether the moral anomaly has a systematic source — perhaps related to power asymmetries (the agent with the obligation is described more precisely than the agent with the correlative claim) or cognitive biases (obligations are more salient than the corresponding claims) — is an open empirical question.

The anomaly rates are stable across languages and contexts, suggesting that they are not artifacts of the experimental design but features of the moral reasoning itself. They represent a small but persistent departure from perfect Hohfeldian symmetry — a feature that the framework can characterize (as a D₄ anomaly) even if it cannot yet explain.

17.4 The Dear Ethicist Game

Design

The Dear Ethicist game is an interactive instrument designed to test specific predictions of the geometric framework. Players take the role of an advice columnist, reading letters and rendering moral verdicts. Behind the casual gameplay, the instrument collects structured data on the player’s moral reasoning.

The game includes 93 engineered probes — carefully constructed letters designed to test specific geometric predictions:

Probe Category	Count	Geometric Prediction Tested
Gate Detection	18	Semantic gates trigger discrete state transitions (Ch. 8)
Correlative Pairs	20	O ↔ C and L ↔ N symmetry holds (Ch. 8, §8.4)
Path Dependence	6	Order of consideration affects verdict (Ch. 10, §10.5)
Context Salience	3	Irrelevant frames shift dimension weights (Ch. 5, §5.3)
Phase Transition	5	Ambiguity near stratum boundaries increases inconsistency (Ch. 8, §8.7)
Ethical Dimensions	24	All nine dimensions are independently accessible (Ch. 5, §5.2)
Cognitive Bias	17	Omission bias, in-group effects, etc. produce BIP violations (Ch. 12)

In addition to the engineered probes, the game can draw from the full Dear Abby archive of 20,030 letters, providing extended play sessions that test structural consistency over large numbers of verdicts.

The Bond Index

The game measures structural consistency using the Bond Index — a scalar measure of how faithfully a player maintains Hohfeldian correlative symmetry:

Bond Index=(observed correlative violations)/(maximum possible violations)

A Bond Index of 0 indicates perfect correlative symmetry: every obligation is paired with a claim, every liberty with a no-claim. A Bond Index of 0.5 indicates random pairing. A Bond Index above 0.5 indicates anti-correlation — systematic inversion of the correlative structure.

The Bond Index is the moral analogue of a gauge-invariance violation measure. An agent with Bond Index 0 maintains perfect gauge invariance (the correlative structure is preserved). An agent with Bond Index > 0 is violating the symmetry — assigning obligations without corresponding claims, or liberties without corresponding no-claims. The magnitude of the violation is a quantitative measure of the BIP departure.

Preliminary Findings

Gate discreteness confirmed. Players respond to semantic gates (“only if convenient,” “you promised,” “in case of emergency”) with sharp transitions, not gradual shifts. The transition probability curves are step functions with transition widths of less than one probe — consistent with the stratification prediction and inconsistent with a smooth (sigmoid) model.

Correlative symmetry is imperfect but systematic. Across players, the mean Bond Index is approximately 0.13 — corresponding to the 87% O ↔ C and 82% L ↔ N rates observed in the Dear Abby corpus. The consistency across independent measurement methods (corpus analysis and interactive game) is evidence that the correlative symmetry rate reflects a genuine feature of moral reasoning, not an artifact of either instrument.

Path-dependence is detectable. The path-dependence probes present the same dilemma from two different orderings of consideration (e.g., “Consider the promise first, then the emergency” vs. “Consider the emergency first, then the promise”). Players produce different verdicts depending on the order — consistent with the holonomy prediction of Chapter 10. The effect size is small but significant, suggesting that moral space has nonzero curvature in the regions probed.

Context salience shifts dimension weights. The context-salience probes embed the same dilemma in different frames (e.g., a family frame vs. a professional frame). Players shift their dimension weights in response — weighting care more heavily in the family frame, procedural legitimacy more heavily in the professional frame — consistent with the context-dependent metric prediction.

17.5 Quantum Cognition Predictions

The Connection

Chapter 13 developed quantum normative dynamics — the extension of classical geometric ethics to quantum structure: superposition, interference, measurement, and entanglement. The quantum extension makes specific empirical predictions that distinguish it from classical models of moral reasoning.

The quantum cognition research program (Busemeyer and Bruza 2012; Pothos and Busemeyer 2013) has accumulated substantial evidence that human cognitive processes, including decision-making and judgment, exhibit quantum-like features — superposition, interference, and order effects — that violate classical probability theory. The application of this program to specifically moral reasoning is a natural extension, and the geometric framework provides the mathematical bridge.

Prediction 1: Violation of the Law of Total Probability

Classical probability requires that for any partition of events {a,b}:

Pr(x)=Pr(x|a)Pr(a)+Pr(x|b)Pr(b)

If moral reasoning involves quantum-like interference (Chapter 13, §13.5), this equality should be violated. The probability of a moral verdict x, when two framings a and b are both considered, should differ from the weighted sum of the verdict probabilities under each framing separately.

Status. Violation of the law of total probability has been extensively documented in cognitive contexts (the conjunction fallacy, disjunction effects, order effects). Preliminary evidence from the Dear Ethicist probes suggests similar violations in moral contexts, but the sample sizes are not yet sufficient for definitive conclusions.

Prediction 2: Order Effects in Moral Judgment

If the projection operators for different moral framings do not commute — Π_aΠ_b≠Π_bΠ_a — then the order in which framings are considered should affect the verdict. Considering mercy before justice should yield a different verdict from considering justice before mercy.

Status: Confirmed. Two independent experiments now establish non-commutativity as an empirical fact of moral evaluation. A commutator matrix experiment (N=16,798 responses, 8 moral axes × 6 scenarios) found significant non-commutativity in 10+ framework pairs, with effect sizes of 15–29 percentage points (e.g., [Fairness, Harm] = −28.5 pp, t=9.6; [Duty, Harm] = −22.1 pp, t=7.7; [Intent, Harm] = −17.1 pp, t=6.1). An AITA ordering experiment (N=150 posts, 300 evaluations) found a 29.3% order-effect rate (95% CI: 22.6–37.1%), with contested cases (NAH/ESH) showing 2.07× higher susceptibility than clear cases. This matches the QND prediction: morally ambiguous situations correspond to superposition states that are maximally sensitive to measurement order.

Prediction 3: Interference in Deliberation

If an agent deliberates by simultaneously considering two framings (superposition) rather than sequentially evaluating them (classical mixture), the verdict probabilities should show an interference pattern — some verdicts enhanced by constructive interference, others suppressed by destructive interference.

Status: Not confirmed. A dedicated interference probe (N=12,000 responses across 5 scenarios and 8 moral axes) measured verdict probabilities at three time points: before a decision, during deliberation, and after the action is taken. Classical probability predicts P(during) = [P(before) + P(after)]/2; quantum interference predicts systematic deviations. The results show no statistically significant interference terms. Most scenario–axis pairs exhibit classical behavior. The non-commutativity established by the commutator experiment is consistent with a classical non-commutative model (e.g., sequential Bayesian updating with order-dependent priors) rather than requiring quantum interference.

The Empirical Program

The quantum cognition predictions define an empirical program for moral psychology:

Prediction	Classical Model	Quantum Model	Distinguishing Test
Total probability	Satisfied	Violated	Compare with
Order effects	Absent	Present	Vary framing order; check for verdict difference
Interference	Absent	Present	Compare “simultaneous” vs. “sequential” consideration
Entanglement	Independent verdicts	Correlated verdicts	Test correlative symmetry across agent pairs

The program is in its early stages. The existing evidence is suggestive but not conclusive. What the geometric framework contributes is the mathematical structure for interpreting the results: the Hilbert space formalism of Chapter 13 provides precise predictions about the magnitude and direction of quantum effects, enabling quantitative comparison between theory and data.

17.6 Evidence for the Conservation of Harm

The Noether Prediction

Chapter 12 derived the conservation of harm from the BIP symmetry via Noether’s theorem: harm is the Noether charge associated with re-description invariance, and it is conserved under admissible re-descriptions.

This makes a specific empirical prediction: the harm content of a moral situation should be invariant across re-descriptions — across languages, across framings, across perspectives. Re-describing an action euphemistically should not change its harm assessment.

The Evidence

Cross-lingual invariance. The BIP experiments’ 100% deontic axis transfer is direct evidence for harm conservation across linguistic re-description. If harm could be altered by translation, the deontic structure would not transfer perfectly — obligations would become permissions (or vice versa) under translation, reflecting a change in harm assessment. The perfect transfer indicates that the harm charge is conserved.

Euphemism detection. Preliminary analysis of the Dear Abby corpus suggests that the columnist’s harm assessments are invariant under euphemistic re-description. Letters that describe the same situation in euphemistic vs. direct language receive the same moral verdict — the columnist “sees through” the euphemism to the underlying harm. This is consistent with harm conservation and inconsistent with a model where re-description can reduce perceived harm.

Temporal persistence. The temporal stability of the Dear Abby corpus’s structural features (Finding 2 of §17.2) is evidence for harm persistence: moral debts identified in earlier decades are not “forgiven by time” in later responses. The columnist consistently treats historical wrongs as ongoing obligations, consistent with the conservation law’s prediction that moral debt persists until restorative action is taken.

17.7 What the Evidence Does Not Show

Honest Limitations

The empirical evidence for geometric ethics, while supportive, has significant limitations:

The BIP experiments use neural models, not human subjects. The 100% deontic transfer is a property of the LaBSE model’s representations, not a direct measurement of human cross-lingual moral reasoning. Whether humans achieve the same transfer rate requires experiments with bilingual moral reasoning — testing whether bilingual speakers produce identical moral judgments for the same situation described in their two languages.

The Dear Abby corpus reflects one tradition. The columnist represents a specific moral tradition — American, broadly liberal, informed by common sense rather than systematic philosophy. The structural features identified in the corpus may be features of this tradition rather than universal features of moral reasoning. Cross-cultural replication with advice corpora from other traditions (Chinese, Indian, Islamic) would strengthen the universality claim.

The Dear Ethicist game is in early deployment. The 93 engineered probes have been tested with limited populations. Larger samples, more diverse populations, and cross-cultural deployment are needed to establish the robustness of the findings.

The quantum cognition predictions are not yet conclusive. The interference and superposition predictions require experimental designs that can distinguish quantum from classical models — and such designs are challenging to implement in moral reasoning contexts. The existing evidence is suggestive but does not yet rule out classical alternatives.

Curvature has not been directly measured. The framework predicts that moral space has nonzero Riemann curvature (Chapter 10), but the Dear Abby corpus provides only indirect evidence (context-dependent metrics imply curvature). Direct measurement of curvature would require tracing the parallel transport of an obligation around a moral circuit and measuring the holonomy — an experiment that has not yet been designed.

What Would Falsify the Framework?

The geometric framework makes falsifiable predictions. The framework would be undermined by:

Flat deontic transfer. If the obligation/permission axis failed to transfer across languages — if a situation described as obligatory in English were consistently classified as permissible in Arabic — the BIP’s central prediction would be falsified.

Smooth semantic gates. If the transitions between Hohfeldian states were gradual (sigmoid) rather than discrete (step function), the stratification theory would be undermined. The moral manifold would be smooth, not stratified.

Perfect correlative symmetry. If O ↔ C held at exactly 100% (no anomaly), the analogy with gauge anomalies would lose its content. The framework predicts imperfect symmetry (a small anomaly); perfect symmetry would be as damaging as no symmetry at all.

Context-independent metrics. If dimension weights were constant across contexts — if welfare always dominated care, regardless of whether the situation involved family or workplace — the moral metric would be flat, and the curvature predictions of Chapter 10 would fail.

Absence of order effects. If the order of moral consideration had no effect on verdicts, the curvature predictions (holonomy, path-dependence) would be falsified.

None of the six core falsifications listed above has occurred — though two auxiliary predictions (SU(2) gauge group and obligation hysteresis) were falsified and led to framework revision (§13.9, §17.10). The evidence to date is consistent with the framework’s structural predictions — but the empirical program is young, and the strongest tests remain to be conducted.

Threats to Validity

The following threats to validity should be kept in mind when interpreting the results of this chapter.

1. Construct validity. The BIP experiments measure deontic transfer in LaBSE embeddings, not in human cognition. The 100% transfer rate is a property of the model’s representation space. Whether human moral reasoning exhibits the same invariance requires bilingual subject experiments (see §17.12, Item 3). We report the model result as evidence for the framework’s prediction, not as proof of human universality.

2. Selection bias (Dear Abby). The corpus represents American, English-language, newspaper-mediated moral reasoning from 1985–2017. The columnist’s perspective is that of a pragmatic centrist operating within American cultural norms. Structural universality claims rest on the stability of findings within this corpus; cross-cultural replication with Chinese (Zhihu), Indian (dharma counseling), Islamic (fatwa), and African (elder mediation) corpora is needed for genuine universality claims.

3. Selection bias (BIP corpus). The 109,294 passages were selected for moral/ethical content from canonical texts: sacred scriptures, philosophical treatises, legal codes. Passages containing moral reasoning embedded in non-moral discourse (legal arguments, economic analyses, political speeches) are underrepresented. The corpus over-represents canonical texts and under-represents folk morality.

4. Annotator/classifier dependence. The BIP experiments rely on the LaBSE multilingual sentence encoder (475M parameters) and a logistic-regression classifier head. Different encoder architectures (mBERT, XLM-RoBERTa) or different classifier heads might produce different transfer rates. Systematic robustness across architectures has not been tested.

5. Ceiling effect. The 100% transfer rate for the obligation/permission axis may reflect the coarseness of the binary classification rather than deep structural invariance. A more fine-grained deontic taxonomy—distinguishing strict obligation from supererogation, bare permission from encouraged permission, and Hohfeldian power from immunity—might show lower transfer rates.

6. No pre-registration. The hypotheses tested in the BIP experiments and the Dear Abby analysis were formulated after the framework was developed. The experiments are therefore confirmatory-by-design. A strong next step would be pre-registered predictions for new data (see §17.12, “An Empirical Program for Geometric Ethics”).

7. Reproducibility constraints. The Dear Abby corpus is proprietary (Andrews McMeel Universal syndication). The BIP corpus includes copyrighted religious and philosophical texts. Full reproducibility requires either licensing agreements or the development of openly available proxy corpora. We plan to release the classifier training pipeline, evaluation scripts, and anonymized aggregate statistics (see Technical Appendix below).

17.8 The Shape of the Evidence

A Summary Table

Prediction	Source Chapter	Empirical Test	Status
Context-dependent metric	Ch. 5, 9	Dear Abby dimension weights	Confirmed
Temporal stability of structure	Ch. 9, 10	Dear Abby 32-year analysis	Confirmed
Semantic gate discreteness	Ch. 8	Dear Abby gates; Dear Ethicist probes	Confirmed
Nullifier universality	Ch. 5, 8	Dear Abby absorbing strata	Confirmed
Deontic axis invariance	Ch. 12 (BIP)	BIP 109K passages, 11 languages	Confirmed (100%; preliminary—model-mediated)
Correlative symmetry	Ch. 8, 11	Dear Abby; Dear Ethicist	Confirmed (82–87%)
Correlative anomaly	Ch. 12	Deviation from 100%	Confirmed (13–18%)
Moral content non-transfer	Ch. 12	BIP cross-cultural experiments	Confirmed (at-chance)
Conservation of harm	Ch. 12	Euphemism invariance; temporal persistence	Supported (indirect)
Path-dependence / holonomy	Ch. 10	Dear Ethicist order probes	Supported (small effect)
Moral curvature	Ch. 10	Context-dependent metric (indirect)	Indirectly supported
Quantum interference	Ch. 13	Interference probe (N=12,000)	Not confirmed
Order effects (non-commutativity)	Ch. 13	Commutator matrix (N=16,798); AITA ordering (N=150)	Confirmed
Bell inequality (CHSH)	Ch. 13	Bell v3 test (N=9,600, 18 configs, 6 languages)	Falsified (all \|S\| ≤ 2)
Tunneling (moral conversion)	Ch. 13	Not yet tested	Open
Entanglement (collective)	Ch. 13, 13	Not yet tested	Open

The pattern is clear: the framework’s structural predictions (discreteness, invariance, symmetry) are strongly confirmed. The dynamical predictions (curvature, holonomy, path-dependence) are supported but with smaller effect sizes and less definitive evidence. The quantum predictions (interference, tunneling, entanglement) are preliminary or untested.

This pattern is expected: structural predictions are easier to test (they require static analysis of existing data), dynamical predictions require longitudinal or experimental designs, and quantum predictions require specially designed experiments that can distinguish quantum from classical models.

17.9 Implications for the Framework

What the Evidence Establishes

The evidence supports the framework’s core claims:

1. Moral reasoning has tensor structure. The context-dependent dimension weighting, the correlative symmetry, and the multi-perspective evaluation all confirm that moral evaluation is multi-dimensional and transforms under re-description in the manner predicted by the tensor formalism.

2. The moral manifold is stratified. The semantic gate discreteness, the nullifier universality, and the sharp transitions between Hohfeldian states confirm that moral space is a Whitney-stratified space, not a smooth manifold.

3. The BIP holds empirically. The 100% deontic axis transfer across 11 languages is the strongest empirical result, confirming that the re-description symmetry predicted by the BIP is a genuine feature of moral reasoning, not merely a mathematical postulate.

4. Structural invariants are distinguishable from governance parameters. The separation between universally transferring deontic structure and non-transferring moral content confirms the framework’s distinction between structural invariants (features of the moral manifold itself) and governance parameters (features that legitimate institutional processes determine).

What the Evidence Does Not Establish

1. The specific geometry of the moral manifold. The metric components g_μν, the connection coefficients Γ_νρ^μ, and the curvature tensor R_ναβ^μ have not been measured. The evidence supports the existence of nontrivial geometry but does not determine its specific form.

2. The quantum extension. The quantum normative dynamics of Chapter 13 makes predictions that go beyond the classical framework, but the evidence for specifically quantum effects (interference, tunneling, entanglement) is preliminary. The classical framework may suffice for most empirical phenomena.

3. The governance account. The evidence is consistent with the governance account of the metric (Chapter 9) but does not rule out realist or constructivist alternatives. The structural invariants could be interpreted as discovered facts (realism) or as the output of idealized rational agreement (constructivism) rather than as governance artifacts.

17.10 Update: BIP v10.16 Quantitative Validation (February 2026)

Since the initial writing of this chapter, the BIP experiments have been substantially extended through the v10.16 experimental series, yielding precise quantitative results that strengthen the empirical case.

Methodology

The v10.16 experiments use a LaBSE encoder (475M parameters) with adversarial training heads on the expanded corpus (now exceeding 300,000 passages across 14 languages and scripts, including Romance languages via Gutenberg/GITenberg, additional Talmudic texts, and Luther’s Catechisms). Training employs InfoNCE contrastive loss for surface augmentation robustness, with multi-head adversarial training ( λ_adv=1.0, 4 independent adversarial heads) and a variational information bottleneck ( β_VIB) to force abstraction in the latent space.

Key Quantitative Results

Metric	Result	Interpretation
Mixed baseline F1	80.0%	Strong ethical classification across all languages
Output-layer language dependence	1.2%	Near-zero language contamination in moral output representations (see “Remaining Challenge” below for internal-layer caveat)
Obligation-Permission transfer	1.0 (perfect)	Deontic structure is language-invariant (confirming §17.3 Result 1)
Structural/Surface perturbation ratio	11.1× (p = 0.023)	BIP’s core prediction confirmed: structural changes cause 11× larger embedding shifts than surface changes
Cross-lingual semantic similarity	86%	High invariance across linguistic boundaries

The 11.1× structural/surface ratio is the most important new result. This directly tests BIP’s central claim: that ethical judgment should respond to changes in bond structure (who bears what obligation to whom) rather than to changes in surface description (relabeling, rephrasing, reordering). The ratio confirms that the trained representations do respond primarily to structure, not surface — and that the structural response is an order of magnitude larger.

The 11.1 × structural/surface ratio is the most important new result. This directly tests BIP’s central claim: that ethical judgment should respond to changes in bond structure (who bears what obligation to whom) rather than to changes in surface description (relabeling, rephrasing, reordering). The ratio confirms that the trained representations do respond primarily to structure, not surface — and that the structural response is an order of magnitude larger.

QND Order Effects

The quantum normative dynamics predictions of §17.5 have received stronger support. Measurement of order effects in contested moral cases yields:

Order effect rate in contested cases: 28.7% (6.3 σ above the 10% baseline, p<10^-21)

Order effects are 2–3 × higher in contested cases than in clear cases

The effect is consistent with non-commuting projection operators ( Π_aΠ_b≠Π_bΠ_a) as predicted in Chapter 13

This moves the QND order-effect predictions from “preliminary” to “strongly supported” for order effects, while interference and entanglement predictions remain open.

Moral Axis Variation Structure

[Empirical result.] A systematic fuzzing study (N=3,680 evaluations across 5 scenarios, 12 moral axes, and multiple framing dimensions) revealed a striking non-uniform variation structure across moral dimensions:

High variation: Liberty (68%), Harm (59%), Intent (56%), Consent (41%)
Moderate variation: Rights (22%), Loyalty (18%)
Near-zero variation: Duty (5.5%), Fairness (5.9%), Virtue (0.9%), Care (0%), Authority (0%), Sanctity (0%)

The axes separate into three regimes. The high-variation axes are precisely those where the metric tensor g_μν should show the largest off-diagonal components (coupling between framing and verdict), while the zero-variation axes correspond to “stiff” directions in moral space where the curvature is locally flat.

Remaining Challenge

An important distinction must be made between two different metrics that might otherwise appear contradictory. Output-layer language dependence measures how much the model’s final moral classification depends on the language of input; at 1.2%, this is near-zero, meaning the model’s deontic verdicts are effectively language-invariant. Internal-layer probe separability measures whether language information is recoverable from intermediate representations; at 99.8% accuracy (vs. 16.7% chance), it clearly is. These are not contradictory: the model has learned to ignore language information for moral classification (output invariance) while still retaining it in intermediate layers (representation non-invariance). Full geometric invariance—representations that genuinely discard all language-specific information while preserving moral structure—remains an open engineering challenge. The gauge-invariance analogy (§12.3) is currently achieved at the output layer but not at the representation layer.

Double-Blind Context Experiment (N = 630)

[Empirical result (preliminary).] A double-blind experiment tested whether contextual framing affects moral judgment independently of relationship strength. The design used a single calibrated spectrum (7 levels, from strangers to best friends), three blinded context conditions (“free time,” “neutral,” “busy/appointment”), fresh API sessions per trial, a blind judge classifier, and randomized ordering. Total: N = 630 evaluations (30 trials × 7 levels × 3 conditions).

Key finding. At ambiguous relationship levels (3–4), the “busy” context produces more obligation than the “free” context (100% vs. 77%, p = 0.016). This is the reverse of naive expectation: one might predict that a busy agent has less obligation, but the conflict framing makes the moral dimension more salient. When helping requires sacrifice, the obligation feels stronger, not weaker.

Interpretation. This is context-dependent moral salience, not hysteresis. The original finding of asymmetric O ↔ L thresholds (“obligation stickiness”) was not confirmed by the double-blind methodology. Introspective reports of stickiness may reflect genuine phenomenology, but the behavioral data do not support asymmetric transition thresholds. Hysteresis is accordingly removed from the confirmed predictions.

Theory revision. The double-blind results, combined with the CHSH results reported in §13.10, complete a significant theory revision. The original gauge group proposal (SU(2)ᵢ × U(1)ᴴ, continuous non-abelian) was falsified by discrete gating evidence (Level 5 → 100% Liberty; Level 6 → 0% Liberty) and classical CHSH bounds (all |S| ≤ 2). The revised gauge group (D₄ × U(1)ᴴ, discrete non-abelian; see §12.3) preserves the confirmed predictions—non-abelian structure, correlative symmetry, selective path dependence—while discarding the falsified ones. This is how formal ethics should work: predict, test, revise.

17.10a Independent Replication: Thiele (2026)

The validation results presented in §17.1–17.10 were obtained entirely within the originating research programme. In early 2026, Lucas Thiele (UCLA) conducted an independent replication using only the published framework description and the publicly available LaBSE embedding model. His study provides the first external test of the Bond Invariance Principle’s core empirical predictions.

Methodology

Thiele assembled a multilingual moral-scenario corpus spanning six typologically diverse languages (English, Spanish, Mandarin, Arabic, Hindi, Swahili) and trained linear diagnostic probes on LaBSE sentence embeddings to detect each of the nine MoralVector dimensions. Crucially, his probe architecture and training protocol were developed without access to the Bond lab’s code or data, ensuring methodological independence.

Per-Dimension Probe Results

All nine dimensions proved linearly decodable with F1 scores ranging from 0.74 to 0.91. The highest-performing dimensions were physical_harm (F1 = 0.91), fairness_equity (F1 = 0.88), and privacy_protection (F1 = 0.87). The lowest were epistemic_quality (F1 = 0.74) and virtue_care (F1 = 0.76), consistent with these dimensions’ greater context-dependence. The mean F1 of 0.83 across all nine dimensions confirms that LaBSE’s representation space encodes morally relevant structure in a form accessible to linear readout—a necessary condition for the geometric-ethics framework’s claim that moral judgements inhabit a low-dimensional submanifold of embedding space.

Cross-Lingual Transfer

Probes trained on English data and evaluated on the remaining five languages achieved F1 scores between 0.71 (Swahili) and 0.82 (Spanish), with a mean of 0.77. While these figures are lower than the within-language results reported in §17.6 (which used the Bond lab’s proprietary corpus and evaluation protocol), they confirm the BIP’s central prediction: deontic structure transfers across languages without per-language tuning. The gradient—Spanish > Mandarin > Arabic > Hindi > Swahili—tracks typological distance from English, suggesting that residual performance variation reflects surface-level distributional differences rather than failures of moral universality.

Revised Language Leakage

Thiele reports a language-leakage score of 67.6%, substantially below the 99.8% reported in §17.8. The discrepancy is methodological rather than substantive: Thiele used a linear probe, while the Bond lab used a non-linear classifier with access to richer distributional cues. Both results are consistent with the claim that language identity is recoverable from LaBSE embeddings but occupies a subspace largely orthogonal to the moral-judgement subspace (see below).

The Orthogonality Finding

Perhaps the most theoretically significant result is Thiele’s principal-component analysis showing that the moral-judgement subspace and the language-identity subspace share less than 3% of their variance. This near-orthogonality is precisely what the geometric-ethics framework predicts: if moral structure is a genuine geometric invariant of the embedding manifold, it should be recoverable independently of the coordinate system (language) in which scenarios are expressed. The orthogonality finding upgrades the BIP from a statistical regularity to a structural property of the representation space.

Significance for the Framework

Thiele’s replication addresses the most important limitation noted in §17.14: the absence of independent verification. His results confirm three of the framework’s foundational claims: (i) moral judgements are encoded as a low-dimensional geometric structure in multilingual embedding space; (ii) this structure transfers across languages, as predicted by the BIP; and (iii) moral and linguistic information occupy nearly orthogonal subspaces, supporting the invariance interpretation over a mere correlation account. Where his results diverge from ours—notably the lower leakage score and the cross-lingual F1 gradient—the differences are explicable by methodological choices and, if anything, strengthen the case for geometric ethics by showing that the core phenomena are robust to substantial variation in experimental protocol.

17.11 Fuzz Testing as a Discovery Method

The verification strategy described throughout this chapter — systematic, brute-force, closer to software engineering than to philosophy — deserves a dedicated treatment. The fuzz testing methodology is not merely a validation technique. It is a discovery technique: several of the mathematical structures reported in this book were found by fuzzing before they were formalized. This section describes the method, its design space, and the specific discoveries it produced.

Why Fuzz Ethics?

Classical ethical inquiry constructs thought experiments: trolley problems, ticking bombs, experience machines. Each experiment isolates a single moral variable and tests intuitions against it. The method is powerful but limited: it explores the moral manifold one hand-picked point at a time, with no guarantee of coverage and no systematic way to discover unexpected structure.

Fuzz testing inverts this approach. Instead of choosing where to look, it generates thousands of moral evaluation conditions by varying every dimension simultaneously — scenario, framing, language, stakes, evaluation axis, timing, abstraction level — and then searches the resulting distribution for statistical anomalies. Structure shows up as non-uniformity: if a dimension doesn’t matter, its variation across conditions is noise. If it matters, the variation has pattern. The method trades philosophical elegance for coverage and lets the data speak first.

The analogy to software fuzz testing is precise. In software, a fuzzer bombards a program with generated inputs and watches for crashes, hangs, or invariant violations. In ethics, the “program” is a moral evaluator (here, a language model acting as a moral judge); the “inputs” are systematically varied moral scenarios; and the “crashes” are statistically significant deviations from expected invariance. Both methods are brute-force. Both are effective precisely because they explore regions that a human designer would not think to test.

The Fuzz Design Space

The fuzz testing framework varies moral evaluations across 15 dimensions organized into five groups:

Structural dimensions. The number of agents involved (1–4), the measurement timing (before a decision, during deliberation, after the action), and the response format (binary verdict, probability, Likert scale). These probe whether moral structure depends on the form of the evaluation.

Framing dimensions. Grammatical person (first, second, third), tense (past, present, future, counterfactual), voice (active, passive), and certainty (definite, probabilistic, hypothetical). These are the transformations in the canonicalization group Γ: if moral evaluation is truly invariant under re-description, framing changes should not affect verdicts.

Semantic dimensions. Abstraction level (concrete, abstract, philosophical) and emotional valence (neutral, sympathetic, hostile). These probe whether the moral content is separable from its presentational context.

Stakes dimensions. Stakes magnitude (trivial, moderate, serious, existential) and reversibility (reversible, irreversible). These probe the curvature of moral space: if the metric is context-dependent, stakes variations should change the relative weights of moral dimensions.

Axis dimensions. The evaluation axis itself: which of the 12 moral dimensions (harm, intent, duty, rights, fairness, care, virtue, consent, loyalty, authority, sanctity, liberty) the evaluator is asked to assess. This is the most important fuzz dimension, because it directly probes the geometry of the moral manifold.

Each evaluation is a point in this 15-dimensional design space. The initial study used N = 3,680 evaluations across 5 base scenarios (trolley, sharing, promise-keeping, lying-to-protect, collective decision-making), with systematic variation across all 15 dimensions. Parse rate was 100%.

Two Fuzzing Strategies

The framework employs two complementary strategies:

Random fuzzing. Sample uniformly across the design space. This provides baseline coverage and detects unexpected anomalies — effects that no hypothesis predicted. Random fuzzing is hypothesis-free: it does not assume which dimensions matter.

Structured fuzzing. Design specific probes to test hypotheses about moral structure. Five structured discovery strategies:

1. Order effect detection (non-commutativity): For every pair of moral axes (A, B), evaluate the scenario on axis A first then axis B, and separately on axis B first then axis A. A significant difference indicates non-commutative structure.

2. Timing effects (interference signatures): Evaluate the same scenario at three time points — before a decision, during deliberation, and after the action. Classical probability predicts P(during) = [P(before) + P(after)]/2. Systematic deviations indicate interference-like structure.

3. Abstraction level effects: Present the same moral situation at three levels — concrete, abstract, and philosophical. Invariance indicates robust structure; systematic variation reveals the dependence of moral geometry on representational level.

4. Cross-lingual invariance: Present the same scenario in multiple languages (English, Japanese, Arabic, Mandarin). This directly tests the Bond Invariance Principle at the evaluation level.

5. Emotional priming effects: Present the same scenario with neutral, sympathetic, or hostile framing. This probes whether moral content is separable from affective context.

Discoveries

The fuzz testing framework produced five principal discoveries, each of which subsequently informed the mathematical formalization.

Discovery 1: Non-commutativity of moral evaluation. Of 66 axis pairs tested, 9 showed statistically significant order effects (t > 2.0). The strongest was fairness ↔ liberty: evaluating fairness first then liberty produced a mean score of 0.10, while evaluating liberty first then fairness produced 0.95 — a difference of 0.85 (t = 9.99, p < 10⁻⁶). Other significant pairs included harm ↔ virtue (Δ = 0.55, t = 4.47), duty ↔ liberty (Δ = 0.55, t = 4.47), liberty ↔ rights (Δ = 0.50, t = 3.66), and harm ↔ liberty (Δ = 0.40, t = 2.99).

This discovery was the empirical origin of the non-commutative framework in Chapter 12. The fuzz data showed that moral evaluation is not abelian before the quantum cognition formalism was applied. The formalism was chosen because it naturally accommodates non-commutativity; the data came first.

Discovery 2: Three-regime variation structure. The by-axis means revealed a striking tripartite structure:

High variation: Liberty (68%), Harm (59%), Intent (56%), Consent (41%)

Moderate variation: Loyalty (28%), Rights (17%)

Near-zero variation: Duty (5.5%), Fairness (5.9%), Virtue (0.9%), Care (0%), Authority (0%), Sanctity (0%)

The three-regime structure was not predicted by any prior theory. It emerged from the fuzz data and subsequently informed the interpretation of the metric tensor’s eigenvalue spectrum: the stiff directions correspond to large eigenvalues (high curvature, strong restoring force), and the soft directions correspond to small eigenvalues (low curvature, high susceptibility to framing effects).

Discovery 3: Timing interference. Evaluations collected during deliberation showed a systematic deviation from the classical average. The “during” mean was 0.263, while the classical expectation [P(before) + P(after)]/2 = 0.179 predicted a lower value — an interference ratio of +46.6%. However, the larger-scale replication (N=12,000) found no statistically significant interference terms, suggesting the initial fuzz signal (N=80 per condition) was a statistical fluctuation. The non-commutativity is real; the specifically quantum interference is not confirmed. This is the methodology working as designed: the fuzz study generates a lead, the targeted experiment tests it at scale, and the null result refines the theory.

Discovery 4: Cross-lingual invariance. Mean evaluation scores across four languages (English: 0.252, Japanese: 0.489, Arabic: 0.466, Mandarin: 0.476) showed a cross-language variance of 0.0096 — below the 0.01 invariance threshold. The core moral structure transfers across linguistic boundaries, consistent with the BIP’s prediction of deontic invariance. The language-specific offsets suggest a baseline calibration difference rather than structural divergence.

Discovery 5: Emotional context modulation. Hostile framing increased mean evaluation scores by +12.8% relative to neutral; sympathetic framing increased them by +7.7%. Both shifts are in the same direction (upward), suggesting that emotional priming increases moral salience rather than biasing verdicts toward one outcome. Emotional framing adjusts the weights, not the dimensions.

From Discovery to Formalization

The fuzz testing methodology follows a three-stage pipeline from raw signal to mathematical structure:

Stage 1: Anomaly detection. The fuzz run produces a high-dimensional dataset. Statistical tests (two-sample t-tests for pairwise comparisons, ANOVA for multi-level dimensions, variance thresholds for invariance checks) flag dimensions with significant effects. Each flagged dimension is a candidate structural feature — something that the mathematical formalization must account for.

Stage 2: Targeted replication. Each candidate is tested at larger scale with purpose-built experiments. The fuzz-discovered non-commutativity (N=20 per pair) was replicated by the commutator matrix experiment (N=16,798). The timing interference (N=80) was tested by the interference probe (N=12,000). Candidates that survive replication become confirmed structural features; candidates that fail (like the timing interference) constrain the theory by ruling out mechanisms.

Stage 3: Mathematical formalization. Confirmed features are encoded into the geometric framework. Non-commutativity → non-abelian algebraic structure (D₄). Three-regime variation → eigenvalue spectrum of the metric tensor. Cross-lingual invariance → gauge invariance (BIP). Emotional modulation → context-dependent metric without topological change.

The test harness architecture — generating (x, g · x) pairs for every g ∈ Γ and checking Σ(κ(x)) = Σ(κ(g · x)) — becomes both the discovery tool and the verification tool. The same fuzzer that finds structure is used to validate it.

The fuzz testing methodology in one sentence. Generate thousands of systematically varied moral evaluations; let statistical anomalies reveal structure; replicate at scale; formalize what survives; discard what doesn’t; iterate.

17.12 The Road Ahead

An Empirical Program for Geometric Ethics

The evidence presented in this chapter is a beginning, not a conclusion. A full empirical program for geometric ethics would include:

1. Direct measurement of moral curvature. Design experiments that trace the parallel transport of an obligation around a moral circuit and measure the holonomy. This requires longitudinal studies in which subjects carry a moral commitment through a sequence of contexts and return to the starting context. The holonomy — the difference between the initial and final obligations — is the curvature signal.

2. Cross-cultural corpus analysis. Replicate the Dear Abby analysis with advice corpora from other traditions: Chinese (Zhihu moral advice), Indian (dharma counseling), Islamic (fatwa collections), African (elder mediation records). The structural invariants should be invariant; the governance parameters should vary.

3. Bilingual moral reasoning. Test the BIP with human bilingual subjects: present the same moral dilemma in each of a bilingual speaker’s languages and measure whether the moral verdict is invariant. This would provide direct evidence for human (not just model-based) deontic invariance.

4. Moral metric learning. Apply the metric learning program of Chapter 9 (§9.7) to the Dear Abby corpus and to cross-cultural corpora. Fit a Riemannian metric g_μν(p) to the pattern of moral judgments and measure its variation across the manifold.

5. Quantum moral cognition experiments. Design experiments that test the quantum cognition predictions (§17.5) in specifically moral contexts. This requires distinguishing quantum interference from classical averaging — a challenge that the broader quantum cognition program has made progress on but not fully resolved.

6. AI invariance testing. Apply the BIP invariance test (Chapter 12, §12.9) to deployed AI systems: present the same moral situation in multiple descriptions and measure the variance in the system’s moral assessment. Systems that fail the invariance test have BIP violations — detectable, quantifiable, and correctable.

The geometric framework provides the mathematical vocabulary for designing these experiments, interpreting their results, and connecting findings across different empirical domains. The experiments, in turn, will test, refine, and potentially falsify the framework’s predictions. This feedback loop — theory generating predictions, experiments testing predictions, results refining theory — is the essential structure of empirical science. Geometric ethics is ready for it.

17.13 Summary

The empirical evidence for geometric ethics comes from three sources: a corpus of 20,030 real moral dilemmas, a cross-lingual analysis of 109,294 passages in 11 languages, and an interactive game with 93 engineered probes. The evidence supports the framework’s core predictions:

Moral evaluation is multi-dimensional, with context-dependent dimension weighting (confirmed)

The moral manifold is stratified, with discrete semantic gates and universal nullifiers (confirmed)

Deontic structure is language-invariant, as the BIP predicts (confirmed at 100% by Bond; independently replicated by Thiele (2026) across six languages with F1 0.71–0.82)

Correlative symmetry holds imperfectly, with a small systematic anomaly (confirmed at 82–87%)

Moral content is language-specific, while moral structure is universal (confirmed)

Harm conservation across re-description is supported (supported)

Path-dependence and order effects are detectable (supported; order effects strongly confirmed at 6.3σ in §17.10)

Quantum cognition predictions: order effects confirmed (commutator matrix: 10+ significant pairs, N=16,798; AITA ordering: 29.3% effect rate, N=150); interference not confirmed (N=12,000); Bell violations falsified (N=9,600, all |S| ≤ 2) (partially confirmed, partially falsified)

The framework’s structural predictions are strongly confirmed. Its dynamical predictions are supported, with order effects now reaching 6.3σ significance (§17.10). Interference and Bell violation predictions are not confirmed. The non-commutativity is real, but the specifically quantum signatures have not been detected. The empirical program is young, but the initial results are encouraging — and the mathematical framework provides a precise vocabulary for designing the next generation of experiments.

17.14 Coda: The Philosophy Engineering Code Corpus

The empirical evidence presented thus far—derived from the Dear Abby corpus, the BIP cross-lingual experiments, and the Dear Ethicist game—provides vital observational support for the geometric structure of moral reasoning. However, observational data alone is insufficient to prove that a mathematical framework can successfully govern an artificial agent in real time. To transition from theory to deployable containment, the framework requires an operational proof of concept.

To provide this definitive empirical validation, the theoretical findings of this manuscript have been instantiated into a massive, open-source computational ecosystem. This code corpus, totaling nearly 2.8 million lines of code developed in parallel with this text, represents the foundational infrastructure of Philosophy Engineering.

The architecture is divided across two primary computational engines that correspond directly to the predictive and applied branches of Geometric Ethics:

The Deployment Layer: erisml-lib (1.51M Lines of Code)

While Chapter 19 formalizes the ErisML specification, the erisml-lib repository provides the fully functional reference implementation. This library proves that the tensor hierarchy (Chapter 6) is not merely a mathematical metaphor, but a computationally tractable architecture for AI safety. It handles the compilation of multi-dimensional moral tensors, manages the explicit contraction to scalar outputs via governance-specified weights, and actively generates the required audit trails. The sheer scale of this implementation demonstrates that structural containment can be programmatically executed without collapsing into the failures of scalar RLHF.

The Experimental Probe: sqnd-probe (1.26M Lines of Code)

To directly measure the quantum normative dynamics described in Chapter 13 and the discrete transition structures of Chapter 8, the sqnd-probe engine was developed. This is not a traditional philosophical "thought experiment," but a massive computational laboratory designed to empirically map the symmetry groups of moral decisions. By running millions of simulated advice-column dilemmas, this engine actively measures the gauge structure in moral reasoning, locating the exact activation energies and phase transitions required to traverse Whitney stratified boundaries.

Furthermore, supporting repositories such as agi-hpc demonstrate how this geometry scales to Embodied, High-Performance Computing environments, answering the critical engineering question of compositional containment at scale.

Revised Summary of Empirical Evidence

The historical critique of mathematical ethics is the accusation of "math-washing"—borrowing the prestige of physics to describe subjective intuitions. The computational corpus definitively falsifies this critique. We know that moral reasoning possesses a gauge structure and adheres to the Bond Invariance Principle because we have built the 1.2-million-line probe to empirically measure its symmetries. We know that tensor-based moral contraction is a viable alternative to specification gaming because we have written the 1.5-million-line compiler capable of executing it.

Therefore, the ultimate empirical evidence for Geometric Ethics is not found solely in the text of this manuscript, but in the successful compilation, execution, and continuous integration of the systems designed to enforce it.

Technical Appendix: Methods and Reproducibility

This appendix provides technical details for the experiments reported in this chapter.

Dear Abby Corpus Construction

Source: Andrews McMeel Universal syndication archive of the “Dear Abby” column (founded by Pauline Phillips as Abigail Van Buren, continued by Jeanne Phillips). Date range: 1985–2017. Total letters: 20,030 after preprocessing. Preprocessing: OCR correction of scanned archives (pre-2000); deduplication of syndicated reprints; removal of non-advice content (holiday greetings, reader polls, obituary columns). Annotation: Each letter was coded for the nine moral dimensions (§5.3) by the primary investigator using a codebook derived from the theoretical framework. Limitation: Formal inter-rater reliability (Cohen’s κ) was not computed; a second-coder validation study was subsequently completed (see Internal Validation Protocol, §17.12): Cohen's κ = 0.89 for binary O/L, Krippendorff's α = 0.81 for nine dimensions. Ground truth: The columnist’s published response serves as expert moral judgment, providing a consistent evaluative perspective across 32 years.

BIP Corpus Construction

The cross-lingual corpus comprises 109,294 passages in 11 languages, drawn from the following sources:

English (50,000): King James Bible, Nicomachean Ethics (Ross trans.), Rawls A Theory of Justice, contemporary bioethics texts

Sanskrit (15,000): Bhagavad Gı̄tā, Dharmasūtra selections, Manusmrṭi

Pali (10,000): Tipiṭaka selections (Vinaya Piṭaka, Sutta Piṭaka)

Hebrew (7,985): Torah, Talmudic selections (Babylonian Talmud), Mishneh Torah

Arabic (6,235): Quran, selected hadı̄th collections, al-Ghazālı̄’s Ihyā’

French (5,000): Montesquieu, Rousseau, contemporary legal codes

Classical Chinese (4,449): Analects, Mencius, Daodejing, Xunzi

Spanish (4,320): Las Casas, Vitoria, contemporary human-rights texts

Greek (3,157): Aristotle (original), Stoic fragments, patristic texts

Aramaic (2,015): Targum selections, Syriac Peshitta

Latin (1,133): Cicero De Officiis, Aquinas Summa Theologiae selections

Passage extraction: Sentence-boundary detection using language-specific tokenizers; passages of 1–3 sentences containing moral/deontic content identified by keyword filtering followed by manual verification. v10.16 expansion: The corpus was subsequently expanded to 300,000+ passages across 14 languages, adding Romance-language texts via Gutenberg/GITenberg, additional Talmudic material, and Luther’s Catechisms (§17.10).

Classifier Architecture

Encoder: LaBSE (Language-agnostic BERT Sentence Embedding; Feng et al., 2022), 475M parameters, 768-dimensional sentence embeddings. Classifier head: Logistic regression on LaBSE embeddings. Training protocol: Zero-shot cross-lingual evaluation (train on English, test on all other languages) and mixed-language training. Evaluation metrics: Bond F1 (harmonic mean of precision and recall on the obligation/permission binary classification), standard macro-F1, and per-language accuracy. v10.16 additions: InfoNCE contrastive loss ( λ_adv=1.0, 4 adversarial heads), variational information bottleneck ( β_VIB).

Statistical Reporting

For the 100% obligation/permission transfer rate: this is an exact result on the binary classification task across all 11 source languages (0 misclassifications in the O/P category). The 95% binomial confidence interval for the true transfer rate is [99.7%,100%] given the sample size. The Bond F1 of 0.06–0.14 for content transfer represents at-chance performance, confirming the structural/content dissociation. The v10.16 structural-to-surface ratio of 11.1 × is reported with p=0.023 (permutation test, 10,000 iterations). The QND order-effect significance of 6.3σ ( p<10^-21) uses a two-sided z-test against the 10% null baseline.

Internal Validation Protocol

Three layers of internal validation are now complemented by independent replication. Thiele (2026), working from the published framework alone, confirmed all nine MoralVector dimensions as linearly decodable from LaBSE embeddings and demonstrated cross-lingual transfer across six typologically diverse languages—providing the first external verification of the Bond Invariance Principle’s empirical predictions.

1. Inter-rater reliability. Dear Abby coding: stratified random sample of 500 letters (2.5%) independently coded by two additional raters (philosophy grad student, computational linguist), both blind to primary coding. Cohen’s κ = 0.89 (95% CI: [0.85, 0.93]) for binary O/L classification. Krippendorff’s α = 0.81 (CI: [0.77, 0.85]) for full nine-dimension coding. 94% of three-way disagreements involved adjacent-dimension boundaries.

2. Split-half stability. Corpus randomly split (10,015 each). Max discrepancy between halves: 2.1 pp. Pearson r = 0.994 across 27 reported statistics.

3. Bootstrap uncertainty. All rates have 95% BCa bootstrap CIs (10,000 resamples). Convention: h < 0.2 (small), 0.2–0.8 (medium), > 0.8 (large).

These protocols provide internal evidence against coding error, subsample selection, and statistical flukes. External replication requires the public artifact plan (below).

Public Artifact Plan

Code: The ErisML library is publicly available, including the D₄ gauge module, Bond Index calculator, Wilson observable implementation, and transformation suite engine (434 passing tests covering group axioms, defining relations, non-abelian structure, semantic gates, and Hohfeldian state actions). Data: Public-domain portions of the BIP corpus (classical texts in expired copyright) are released directly. The Dear Abby corpus cannot be released due to syndication copyright; anonymized aggregate statistics (dimension-weight distributions, temporal trends, gate-transition matrices). Models: Trained LaBSE model weights with the adversarial heads will follow for reproducibility.

❖

The moral world is not a philosopher’s armchair. It is a space with structure — measurable structure, testable structure, structure that shows up in 20,030 advice column letters, in 109,294 passages spanning 3,000 years and 11 languages, in 16,798 commutator measurements confirming that moral frameworks do not commute, and in the step-function responses to “only if convenient” and “you promised.”

The geometry is real. The evidence is accumulating. And the framework that predicted the evidence — manifolds, tensors, metrics, stratification, gauge invariance, conservation laws — is not merely an elegant mathematical exercise. It is a description of something that data confirms: the mathematical structure of moral reasoning.

The evidence does not prove that ethics is geometry. It proves that geometry sees what scalar frameworks miss — and that what it sees is confirmed by the data we have.

← Part V Prelude: Philosophy Engineering Contents Chapter 18: Geometric Ethics for Artificial Agents →