This appendix provides a complete specification of the Bond Invariance Principle (BIP) experiment described in Chapters 11–12, sufficient for independent replication. The experiment tests whether Hohfeldian moral relations—Right, Duty, Liberty, No-Right—constitute a gauge-invariant structure that persists across languages, cultures, and millennia. We train an adversarial disentanglement model on ancient Hebrew and Aramaic ethical texts (500 BCE–1800 CE) and test zero-shot transfer to modern American English advice columns (1956–2020), with supplementary evaluation on Classical Chinese and Arabic corpora.
All code and data are available at github.com/ahb-sjsu/sqnd-probe. Configuration files referenced below are config_bip.yaml and corpora.yaml in the repository root.
The primary training corpus is drawn from the Sefaria Project (sefaria.org), a freely available, community-maintained library of Jewish texts. The Sefaria-Export repository provides the complete digitized corpus as structured JSON files organized by category, with parallel Hebrew/Aramaic originals and English translations.
Acquisition. The corpus is obtained by shallow-cloning the Sefaria-Export repository:
git clone --depth 1 https://github.com/Sefaria/Sefaria-Export.git
# Approximate size: 8.3 GB
The JSON directory contains subdirectories for each category of text (Tanakh, Mishnah, Talmud, Midrash, Halakhah, Kabbalah, Musar, Responsa, etc.). Each text is stored as a JSON file with he (Hebrew/Aramaic original) and text or en (English translation) fields, organized hierarchically by chapter and verse or folio and line.
Temporal bins. Passages are assigned to temporal periods based on the Sefaria directory structure and category metadata:
| Code | Period | Date Range | Primary Languages | Representative Texts |
|---|---|---|---|---|
| BIB | Biblical | 1000–500 BCE | Biblical Hebrew | Torah, Prophets, Writings |
| ST | Second Temple | 500 BCE–70 CE | Late Hebrew | Apocrypha, sectarian texts |
| TAN | Tannaitic | 70–200 CE | Mishnaic Hebrew | Mishnah, Tosefta |
| AMO | Amoraic | 200–500 CE | Aramaic, Hebrew | Talmud Bavli, Talmud Yerushalmi, Midrash |
| GEO | Geonic | 600–1000 CE | Judeo-Arabic, Hebrew | Responsa literature |
| RIS | Rishonim | 1000–1500 CE | Medieval Hebrew | Rashi, Maimonides, Nachmanides |
| ACH | Achronim | 1500–1800 CE | Early Modern Hebrew | Shulchan Aruch, Chasidic texts |
Period assignment is performed by a deterministic mapping from Sefaria category to period:
CATEGORY_TO_PERIOD = {
'Tanakh': 'BIBLICAL', 'Torah': 'BIBLICAL',
'Prophets': 'BIBLICAL', 'Writings': 'BIBLICAL',
'Mishnah': 'TANNAITIC', 'Tosefta': 'TANNAITIC',
'Talmud': 'AMORAIC', 'Bavli': 'AMORAIC',
'Yerushalmi': 'AMORAIC','Midrash': 'AMORAIC',
'Halakhah': 'RISHONIM', 'Kabbalah': 'RISHONIM',
'Philosophy': 'RISHONIM',
'Musar': 'ACHRONIM', 'Responsa': 'ACHRONIM',
'Chasidut': 'ACHRONIM',
}
Passage extraction. The SefariaLoader recursively traverses the JSON directory tree. For each JSON file, it extracts paired (Hebrew/Aramaic original, English translation) passages by flattening nested list structures. Priority is given to ethically rich tractates: the Talmudic orders Nezikin (damages), Nashim (women/family law), and Mo’ed (festivals), as well as the Mishnaic tractates Pirkei Avot, Bava Kamma, Bava Metzia, Sanhedrin, and Ketubot.
Text cleaning. All passages undergo the following preprocessing:
Language detection. Passages from Talmud Bavli, Targums, and Zohar are tagged as Aramaic; all others as Hebrew. Language is assigned from Sefaria metadata rather than automatic detection, since the two languages frequently co-occur.
Deduplication. Each passage receives a deterministic ID: the first 12 hex characters of the MD5 hash of the source name, structural reference, and first 50 characters of the Hebrew original. Re-runs produce identical IDs; exact duplicates are naturally deduplicated.
The final training corpus comprises approximately 3.9 million passages with both Hebrew/Aramaic original and English translation. For training, the model receives only the original Hebrew/Aramaic text (or the English text in the appropriate experimental condition), never the translation.
The test corpus consists of letters from the “Dear Abby” syndicated advice column, spanning 1956–2020. The corpus was chosen because it is maximally distant from the training corpus along every conceivable confound: language (American English vs. Hebrew/Aramaic), era (20th–21st century vs. ancient/medieval), genre (informal advice-seeking vs. legal/religious commentary), and cultural context (secular American vs. Jewish legal tradition).
Source. The Dear Abby dataset is available as a CSV file via Kaggle. The columns include the date of publication, the reader’s question (letter), and the columnist’s answer (response).
Preprocessing. Each letter is formatted as a combined question-answer pair (QUESTION: [text] ANSWER: [response]). The same cleaning pipeline (Unicode normalization, whitespace normalization, length filtering at 50–2,000 characters) is applied. Year of publication is extracted for temporal metadata; topical category (family, marriage, work, friendship, etc.) is assigned by keyword regex matching.
Deduplication. Letter IDs are the first 12 hex characters of the MD5 hash of the date and first 50 characters of the question. The final test corpus comprises approximately 68,000 letters, with an additional 20,000-letter archive (1985–2017) available for extended evaluation.
Two additional corpora test cross-cultural invariance beyond the Semitic–Germanic language family boundary.
Classical Chinese. The core Chinese corpus draws from the Chinese Text Project (ctext.org) and includes the Analects, Mencius, Dao De Jing, Zhuangzi, Daxue, and Zhongyong, representing the Confucian and Daoist traditions (approximately 500 BCE–200 CE). In the v10.9 expansion, the corpus is diversified to include:
This diversification tests whether Chinese transfer performance reflects genuine invariant structure or corpus homogeneity.
Arabic. The Arabic corpus includes the Quran, Hadith collections, Islamic legal maxims (qawa’id fiqhiyya), Sufi ethical texts (al-Ghazali, al-Qushayri, Rumi), and Arabic philosophical ethics (al-Farabi, Ibn Rushd, Ibn Khaldun). The v10.8 corpus (~180 passages) is expanded substantially in v10.9.
Sanskrit/Pali (planned). Passages from the Mahabharata, Bhagavad Gita, Manusmriti, Upanishads, Dhammapada, and Vinaya Pitaka serve as an independence test: an Indo-European tradition developed independently of both Semitic and Sinitic moral thought.
Hohfeldian labels are assigned to passages through a hybrid extraction pipeline. The primary method is pattern-based, with optional LLM augmentation for ambiguous cases.
Pattern-based extraction. For each passage, the English text (the translation for ancient texts, the original for Dear Abby) is scanned against compiled regular expression patterns for each Hohfeldian state:
| Hohfeldian State | English Patterns | Hebrew Patterns |
|---|---|---|
| OBLIGATION (Duty) | must, shall, duty, required, obligated, responsible, have to, need to | חייב (chayav), מצווה (mitzvah) |
| RIGHT (Claim) | right to, entitled, deserve, owed, demand, may claim | זכאי (zakai) |
| LIBERTY (Privilege) | may, can, permitted, allowed, free to, optional, voluntary | מותר (mutar), רשות (reshut) |
| NO-RIGHT (Exposure) | cannot demand, no right, not entitled, may not require | אסור (asur) |
Each pattern is compiled with case-insensitive matching. The Hohfeldian state with the highest match count across its pattern set is assigned as the passage’s primary label. Passages matching no patterns receive a NONE label and are excluded from supervised training on the Hohfeldian classification task.
Bond type extraction. In addition to Hohfeldian state, each passage is labeled with one or more of ten bond types, identified by a separate set of domain-specific patterns:
| Bond Type | Representative Patterns |
|---|---|
| HARM_PREVENTION | kill, murder, harm, hurt, save, rescue, protect |
| RECIPROCITY | return, repay, owe, debt, mutual, exchange |
| AUTONOMY | choose, decision, consent, force, coerce |
| PROPERTY | property, own, steal, buy, sell, land |
| FAMILY | honor parent, marry, divorce, inherit |
| AUTHORITY | obey, command, law, judge, permit, forbid |
| CARE | care, help, assist, feed, visit, charity |
| FAIRNESS | fair, just, equal, deserve, bias |
| EMERGENCY | emergency, urgent, life-threatening, pikuach nefesh |
| CONTRACT | promise, vow, oath, agreement, pledge |
Consent status. Each passage is also classified by consent status (EXPLICIT_YES, IMPLICIT_YES, CONTESTED, IMPLICIT_NO, EXPLICIT_NO, IMPOSSIBLE), determined by a third pattern set targeting consent-indicative language. Disputed passages—those containing markers such as “machlok,” “some say… others say,” or multiple attributed rabbinic opinions—are flagged and assigned to a “contested” consensus tier.
Known limitations. The pattern-based approach is deliberately noisy. Label noise makes transfer harder, not easier, so successful transfer despite noise constitutes stronger evidence for invariant structure. A hybrid mode (pattern + LLM classification via Claude 3 Haiku, confidence threshold 0.7) is specified in config_bip.yaml for production use. The experiments in Chapters 11–12 use pattern-based labels only.
The base encoder is a pretrained multilingual sentence transformer. Two configurations are used across experimental versions:
| Version | Model | Embedding Dim | Languages |
|---|---|---|---|
| v10.8 (paper draft) | paraphrase-multilingual-MiniLM-L12-v2 |
384 | 50+ languages |
| v10.9+ (full tensor) | sentence-transformers/all-mpnet-base-v2 |
768 | English-dominant, multilingual |
| Alternate | LaBSE (sentence-transformers/LaBSE) |
768 | 109 languages |
The encoder maps each passage to a fixed-dimensional embedding $h \in \mathbb{R}^{d}$ via mean pooling over the last hidden state, weighted by the attention mask:
$$h = \frac{\sum_{i=1}^{L} m_i \cdot h_i}{\sum_{i=1}^{L} m_i}$$
where $h_i$ is the hidden state at position $i$, $m_i$ is the attention mask value, and $L$ is the sequence length. If the encoder’s native hidden size differs from the target $d_\text{model}$, a learned linear projection maps to the target dimension.
The encoder may be frozen (all parameters fixed) or fine-tuned. The default configuration fine-tunes the encoder with a lower learning rate than the classification heads. Maximum sequence length is 256 tokens (v10.8) or 512 tokens (v10.9+).
The core architectural innovation is the disentanglement of the encoder representation into two complementary latent spaces:
Both projections use a variational architecture with the reparameterization trick:
h (768-dim)
|
+---> bond_mean: Linear(768, 384) -> ReLU -> Dropout(0.1) -> Linear(384, d_bond)
+---> bond_logvar: Linear(768, 384) -> ReLU -> Dropout(0.1) -> Linear(384, d_bond)
|
+---> label_mean: Linear(768, 384) -> ReLU -> Dropout(0.1) -> Linear(384, d_label)
+---> label_logvar: Linear(768, 384) -> ReLU -> Dropout(0.1) -> Linear(384, d_label)
During training, $z_\text{bond} = \mu_\text{bond} + \sigma_\text{bond} \cdot \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$ and $\sigma_\text{bond} = \exp(0.5 \cdot \text{logvar}_\text{bond})$. During evaluation, $z_\text{bond} = \mu_\text{bond}$.
The model has multiple classification heads, each a two-layer MLP:
| Head | Input | Output Classes | Objective |
|---|---|---|---|
| Hohfeld Classifier | $z_\text{bond}$ | 4 (Right, Duty, Liberty, No-Right) | Maximize accuracy |
| Bond Type Classifier | $z_\text{bond}$ | 10 (bond types) | Maximize accuracy |
| Time Classifier (adversarial) | GRL($z_\text{bond}$) | 8–26 (time periods) | Minimize accuracy (~chance) |
| Time Classifier (control) | $z_\text{label}$ | 8–26 (time periods) | Maximize accuracy |
| Language Classifier (adversarial) | GRL($z_\text{bond}$) | 4–8 (languages) | Minimize accuracy (~chance) |
| Genre Classifier (adversarial) | GRL($z_\text{bond}$) | 4 (legal, narrative, wisdom, advice) | Minimize accuracy (~chance) |
Each classifier head follows the architecture:
Linear(d_input, hidden_dim) -> ReLU -> Dropout(p) -> Linear(hidden_dim, n_classes)
where hidden_dim = 128 for adversarial heads and 64 for task heads, and dropout $p = 0.2$ for adversarial heads, $p = 0.1$ for task heads.
The adversarial heads receive input through a Gradient Reversal Layer (GRL; Ganin et al., 2016). The GRL is the identity function in the forward pass and negates (and scales) the gradient in the backward pass:
$$\text{GRL}(x) = x \quad \text{(forward)}$$ $$\frac{\partial \text{GRL}}{\partial x} = -\lambda \cdot I \quad \text{(backward)}$$
where $\lambda$ is the reversal strength. This forces $z_\text{bond}$ to become uninformative about the adversarial target (time period, language, genre) while remaining informative about the task target (Hohfeldian state, bond type). The parameter $\lambda$ controls the adversarial pressure; see Section C.3.2 for the schedule.
Input text
|
v
[Encoder: all-mpnet-base-v2 or LaBSE]
|
v
h (768-dim)
|
+-------> [Bond Projection (VAE)] --> z_bond (64 or 128-dim)
| |
| +----+----+----+
| | | | |
| v v v v
| [HOH] [BND] [GRL]-->[TIME_ADV]
| cls cls [LANG_ADV]
| [GENRE_ADV]
|
+-------> [Label Projection (VAE)] --> z_label (32 or 64-dim)
|
v
[TIME_CTL]
cls
The total loss is a weighted sum of six components:
$$\mathcal{L} = \lambda_\text{hohf} \mathcal{L}_\text{hohf} + \lambda_\text{bond} \mathcal{L}_\text{bond} + \lambda_\text{time} \mathcal{L}_\text{time} + \lambda_\text{adv} \mathcal{L}_\text{adv} + \lambda_\text{kl} \mathcal{L}_\text{kl} + \lambda_\text{bip} \mathcal{L}_\text{bip}$$
where:
| Parameter | Value | Notes |
|---|---|---|
| Optimizer | AdamW | $\beta_1 = 0.9$, $\beta_2 = 0.999$ |
| Learning rate | $2 \times 10^{-5}$ | For both encoder and heads |
| Weight decay | 0.01 | |
| LR schedule | OneCycleLR | 10% warmup, cosine annealing |
| Warmup steps | 1,000 | |
| Batch size | 32 | |
| Max epochs | 50 | |
| Early stopping patience | 5 epochs | On validation loss |
| Gradient clipping | 1.0 (max norm) | |
| Max sequence length | 512 tokens | |
| $\lambda_\text{hohf}$ | 1.0 | |
| $\lambda_\text{bond}$ | 1.0 | |
| $\lambda_\text{time}$ | 1.0 | |
| $\lambda_\text{adv}$ | 1.0 | Fixed; GRL $\lambda$ controls adversarial strength |
| $\lambda_\text{kl}$ | 0.1 | Low weight prevents posterior collapse |
| $\lambda_\text{bip}$ | 2.0 | High weight emphasizes structural invariance |
| GRL $\lambda$ | 1.0 | Fixed in primary experiments |
| Dropout (task heads) | 0.1 | |
| Dropout (adversarial heads) | 0.2 | |
| InfoNCE temperature | 0.07 |
Each epoch proceeds as follows:
The primary experiments use a fixed $\lambda = 1.0$ for the gradient reversal layer. An alternative schedule, ramping $\lambda$ from 0 to 1 over the first 20% of training, was tested but did not produce significantly different results. The fixed schedule is preferred for simplicity and reproducibility.
| Split | Content |
|---|---|
| Train | ~3.9M Hebrew/Aramaic passages (BIB, ST, TAN, AMO, GEO, RIS periods; 500 BCE–1500 CE) |
| Validation | Achronim passages (ACH period; 1500–1800 CE), held out |
| Test | ~68K Dear Abby letters (1956–2020, American English) |
This is the primary BIP test. The model is trained exclusively on ancient and medieval Hebrew/Aramaic text and evaluated on modern English with no additional training, adaptation, or fine-tuning.
| Split | Content |
|---|---|
| Train | ~68K Dear Abby letters (1956–2020) |
| Validation | ~6K Dear Abby letters, held out |
| Test | ~50K Hebrew passages, sampled from the Sefaria corpus |
Bidirectional transfer controls for the possibility that the result is an artifact of training on the larger corpus. If transfer works in both directions, the invariant structure is genuine.
| Split | Content |
|---|---|
| Train | 70% mixed ancient + modern |
| Test | 30% mixed ancient + modern |
This establishes the ceiling for in-distribution performance.
For the temporal invariance protocol (Chapter 12), all 56 pairwise transfers between the 8 time periods are computed as a SLURM array job (one transfer experiment per array task). Each cell in the $8 \times 8$ matrix (minus the diagonal) records the Hohfeldian classification accuracy when training on one period and testing on another.
For each of the 8 temporal periods with sufficient data (>100 passages), the model is trained on all other periods and tested on the held-out period. This yields 8 independent tests of temporal transfer.
F1 (macro). The primary metric for Hohfeldian classification. Macro F1 gives equal weight to each of the four classes regardless of class frequency, ensuring that minority classes (typically No-Right) are not ignored. Reported as the mean across 5 random seeds with 95% confidence intervals.
Accuracy. Overall classification accuracy, with Wilson score 95% confidence intervals.
Per-class precision and recall. For each Hohfeldian state (Right, Duty, Liberty, No-Right), we report precision $P_c = \text{TP}_c / (\text{TP}_c + \text{FP}_c)$ and recall $R_c = \text{TP}_c / (\text{TP}_c + \text{FN}_c)$.
Language probe accuracy. After training, a fresh linear probe is trained on frozen $z_\text{bond}$ representations to predict the source language. Successful disentanglement is indicated by probe accuracy near the chance level (25% for 4 languages, 12.5% for 8 languages). The reported result of 1.2% (below chance) and 0.0% for the cross-lingual condition confirms that $z_\text{bond}$ contains no recoverable linguistic information.
Time period probe accuracy. Analogously, a fresh linear probe trained on frozen $z_\text{bond}$ to predict time period should achieve near-chance accuracy. Near-chance time prediction from $z_\text{bond}$, combined with above-chance Hohfeldian prediction, demonstrates that the bond space encodes moral structure but not temporal-stylistic features.
Time Invariance Score (TIS). A composite metric:
$$\text{TIS} = 1 - \frac{\text{Time\_acc} - \text{chance}}{1 - \text{chance}}$$
where Time_acc is the adversarial classifier’s accuracy from $z_\text{bond}$. TIS > 0.90 indicates strong temporal invariance.
The Bond Index quantifies the structural consistency of Hohfeldian classifications by measuring correlative symmetry violations. Hohfeldian positions come in correlative pairs: Duty $\leftrightarrow$ Right, and Liberty $\leftrightarrow$ No-Right. If the model classifies one party as having a Duty, the correlative party should have a Right. A classification that assigns Duty to one party and No-Right to the other constitutes a symmetry violation.
$$\text{Bond Index} = \frac{\text{observed violations}}{\text{maximum possible violations}}$$
A Bond Index of 0.0 indicates perfect correlative symmetry; 0.5 indicates random classification; values above 0.5 indicate systematic anti-correlation.
To validate that $z_\text{bond}$ captures structural rather than surface features, we compute embedding distances under two perturbation types:
The ratio of mean structural distance to mean surface distance (the “structural sensitivity ratio”) should be significantly greater than 1.0. The reported ratio of 11.1x ($t = 2.46$, $p = 0.023$) confirms structural sensitivity with surface invariance.
PCA of $z_\text{bond}$. Principal component analysis on the bond space reveals its effective dimensionality. The reported finding is that 3 components explain 90% of variance, consistent with the theoretical prediction that Hohfeldian relations form a bounded, low-dimensional manifold.
Manifold comparison. For each time period, UMAP projections of $z_\text{bond}$ are computed. If BIP holds, the manifolds should be congruent (similar cluster structure, similar topological features) across periods, differing only in orientation. Procrustes analysis quantifies the alignment between period-specific manifolds.
All reported accuracies and F1 scores are tested against the chance baseline (25% for 4-class Hohfeldian classification) using a $z$-test:
$$z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$$
The primary result ($z = 52.3$, $p < 10^{-50}$) is computed on the full test set of ~68,000 letters. Effect size is reported as Cohen’s $h = 2 \arcsin(\sqrt{\hat{p}}) - 2 \arcsin(\sqrt{p_0})$.
All metrics are reported with 95% confidence intervals computed by one of:
The full transfer matrix involves 56 pairwise tests. Bonferroni correction sets the per-comparison significance threshold at $\alpha = 0.05 / 56 = 0.0009$. The conservative significance level of $\alpha = 0.001$ specified in config_bip.yaml satisfies this requirement.
For the full tensor experiment (Chapter 12, Section 12.2), a MANOVA decomposes variance in $z_\text{bond}$ across five factors: Time Period, Language, Genre, Bond Type, and Hohfeldian State. The key prediction is:
| Factor | Predicted $\eta^2$ | BIP Interpretation |
|---|---|---|
| Hohfeldian State | > 0.30 | Large effect: $z_\text{bond}$ is organized by moral structure |
| Bond Type | > 0.20 | Large effect: domain-specific moral content preserved |
| Time Period | < 0.05 | Negligible: temporal variation successfully removed |
| Language | < 0.05 | Negligible: linguistic variation successfully removed |
| Genre | < 0.10 | Small: some genre signal may remain |
The bootstrap comparison of effect sizes ($n_\text{bootstrap} = 10{,}000$) tests the directional hypothesis $H_1: \eta^2(\text{Time}) < \eta^2(\text{Hohfeld})$.
The eight temporal bins (seven ancient/medieval, one modern) are defined in Section C.1.1. For the Temporal Invariance Protocol specifically, the bins serve as both the holdout variable and the adversarial target. The critical design choice is that the bins are defined by scholarly periodization (based on literary, linguistic, and historical criteria), not by the model. This prevents circular reasoning.
The primary cross-temporal test uses the temporal holdout split:
The temporal gap between the latest training data (Rishonim, ~1500 CE) and the test data (Dear Abby, 1956–2020) is approximately 450–2,500 years depending on training period.
The leave-one-era-out protocol provides finer-grained analysis. For each era $e$, the model is trained on $\{E \setminus e\}$ and tested on $e$. This produces a per-era transfer profile: which eras transfer to which? The transfer score for each cell is:
$$T_{s \to t} = \text{Hohfeld\_acc}_{s \to t} \times (1 - \text{Time\_acc}_{s \to t})$$
This penalizes configurations where Hohfeldian accuracy is high but the model is also successfully predicting time period—indicating that temporal features, not invariant structure, are driving the classification.
Translation artifact control. Hohfeldian labels for ancient texts are extracted from English translations, but the model is trained on original Hebrew/Aramaic. This design ensures that any transfer to English cannot be mediated by translation artifacts in the training signal. The model must learn structure present in the original language that happens to transfer to English—exactly the invariance claim.
Corpus size control. The Sefaria corpus (~3.9M passages) is ~57x larger than Dear Abby (~68K). In bidirectional transfer, the Sefaria test set is subsampled to 68K passages to match.
Genre control. The adversarial genre head forces $z_\text{bond}$ to discard genre information. The Dear Abby corpus is maximally genre-distant from all Sefaria categories, making genre-based shortcuts impossible.
Encoder pretraining control. The multilingual encoder was pretrained on general-purpose web text, not on either experimental corpus. It provides a shared multilingual coordinate system; moral structure must be learned by the task-specific heads.
All experiments use 5 random seeds: {42, 123, 456, 789, 1024}. For each seed, torch.manual_seed(seed), numpy.random.seed(seed), and random.Random(seed) are set. Results are reported as mean $\pm$ standard deviation across seeds.
The experiments were developed and tested on the following hardware:
| Resource | Specification |
|---|---|
| Primary development | SJSU College of Engineering HPC cluster |
| GPU (full tensor) | NVIDIA H100 (80 GB VRAM) |
| GPU (transfer matrix) | NVIDIA P100 (12 GB VRAM) x 10, SLURM array |
| GPU (standard training) | NVIDIA V100 (16 GB VRAM) |
| CPU (preprocessing) | 16 cores, 64 GB RAM |
| Storage | 524 TB Lustre scratch (checkpoints), 100 TB /data (corpora) |
For replication on more modest hardware, a single GPU with 12+ GB VRAM (e.g., NVIDIA T4) is sufficient for training with a reduced batch size (16 instead of 32). Preprocessing is CPU-only.
python >= 3.10
torch >= 2.0.0
transformers >= 4.35.0
sentence-transformers >= 2.2.0
datasets >= 2.14.0
numpy >= 1.24.0
pandas >= 2.0.0
scipy >= 1.11.0
scikit-learn >= 1.3.0
spacy >= 3.6.0 (with en_core_web_lg model)
nltk >= 3.8.0
hebrew-tokenizer >= 2.3.0
pyarabic >= 0.6.0
matplotlib >= 3.7.0
seaborn >= 0.12.0
wandb >= 0.15.0 (optional, for experiment tracking)
tqdm >= 4.65.0
pyyaml >= 6.0.0
| Phase | Hardware | Expected Time |
|---|---|---|
| Sefaria git clone | Any (network-bound) | ~15 minutes |
| Preprocessing (4M passages) | 16 CPU cores | ~2 hours |
| Embedding extraction | 1x P100 | ~4 hours |
| Transfer matrix (56 experiments) | 10x P100 in parallel | ~2 hours |
| Full tensor training (20 epochs) | 1x H100 | ~8–12 hours |
| MANOVA analysis | 16 CPU cores | ~1 hour |
| Manifold analysis (UMAP + clustering) | 1x A100 | ~2 hours |
| Total wall time (with parallelization) | ~18 hours | |
| Total GPU-hours | ~70 hours |
On a single T4 GPU (Kaggle-class hardware), the full pipeline requires approximately 72 hours of wall time. The transfer matrix cannot be parallelized and dominates the runtime.
| Resource | Location | Size |
|---|---|---|
| Sefaria-Export | github.com/Sefaria/Sefaria-Export |
~8.3 GB |
| Dear Abby dataset | Kaggle (see repository README) | ~50 MB |
| Chinese Classics | ctext.org |
~5 MB (curated sample) |
| Code and configuration | github.com/ahb-sjsu/sqnd-probe |
~100 MB |
sqnd-probe/
+-- config_bip.yaml # Primary configuration
+-- corpora.yaml # Corpus definitions and future plans
+-- data/
| +-- raw/
| | +-- Sefaria-Export/ # Cloned repository (~8.3 GB)
| | +-- dear_abby.csv # Kaggle dataset
| | +-- chinese/ # Chinese classics (JSON)
| | +-- islamic/ # Islamic texts (JSON)
| +-- processed/
| | +-- passages.jsonl # Preprocessed passages
| | +-- bond_structures.jsonl # Extracted bond structures
| | +-- corpus_stats.json # Corpus statistics
| +-- splits/
| +-- all_splits.json # All split definitions
+-- src/
| +-- data/
| | +-- preprocess.py # Corpus loading and preprocessing
| | +-- extract_bonds.py # Bond structure extraction
| | +-- generate_splits.py # Train/valid/test split generation
| | +-- verify_data.py # Data verification
| +-- models/
| | +-- bip_model.py # Model architecture
| +-- train.py # Training loop
| +-- evaluate.py # Evaluation and statistical tests
+-- models/
| +-- checkpoints/ # Saved model weights
+-- results/
+-- metrics/ # JSON metric files
+-- figures/ # Generated plots
For reference, the primary results reported in Chapter 11 are:
| Metric | Value | 95% CI |
|---|---|---|
| Bond F1 (macro), Ancient to Modern | 44.5% | 42.1–46.9% |
| Bond Accuracy, Ancient to Modern | 50.7% | 48.3–53.1% |
| Chance Baseline | 25.0% | — |
| $z$-statistic | 52.3 | $p < 10^{-50}$ |
| Cohen’s $h$ | 0.52 | Medium effect |
| Language Probe Accuracy from $z_\text{bond}$ | 0.0% | (chance = 25%) |
| Mixed Training F1 (ceiling) | 79.99% | — |
| Transfer Efficiency (transfer / ceiling) | 55.6% | — |
| Structural/Surface Perturbation Ratio | 11.1x | $p = 0.023$ |
| PCA: Components for 90% Variance | 3 | — |
| Obligation-Permission Transfer Accuracy | 100% | — |
Per-language transfer F1 (v10.8): Hebrew 64.5%, English 41.5%, Arabic 40.7%, Classical Chinese 64.9%.
For a graduate student wishing to replicate the primary result (Direction A: Ancient to Modern):
# 1. Clone and install
git clone https://github.com/ahb-sjsu/sqnd-probe.git
cd sqnd-probe
pip install -e ".[dev]"
# 2. Acquire data
mkdir -p data/raw
cd data/raw
git clone --depth 1 https://github.com/Sefaria/Sefaria-Export.git
# Place dear_abby.csv from Kaggle in data/raw/
cd ../..
# 3. Preprocess
python src/data/preprocess.py --config config_bip.yaml
# 4. Extract bond structures
python src/data/extract_bonds.py --config config_bip.yaml
# 5. Generate splits
python src/data/generate_splits.py --config config_bip.yaml
# 6. Train (temporal holdout split)
python src/train.py --config config_bip.yaml --split temporal_holdout
# 7. Evaluate
python src/evaluate.py --config config_bip.yaml \
--model models/checkpoints/temporal_holdout/best_model.pt
Expected output: Hohfeldian classification F1 in the range 42–47% on the Dear Abby test set, with time prediction accuracy from $z_\text{bond}$ near or below chance.
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., & Lempitsky, V. (2016). Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1), 2096–2030.
Hohfeld, W. N. (1913). Some fundamental legal conceptions as applied in judicial reasoning. Yale Law Journal, 23(1), 16–59.
Hohfeld, W. N. (1917). Fundamental legal conceptions as applied in judicial reasoning. Yale Law Journal, 26(8), 710–770.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of EMNLP.
Sefaria Project. (n.d.). Sefaria: A living library of Torah texts online. https://sefaria.org