Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
740b121
chore(multi): scaffold the multi-witness subpackage
bsesic May 22, 2026
304db3c
feat(multi): add GraphNode and GraphEdge types
bsesic May 22, 2026
fccd3cc
feat(multi): add VariantGraph container
bsesic May 22, 2026
6989eed
feat(multi): add VariantGraph.from_sequence and witness_path
bsesic May 22, 2026
3c86fe1
feat(multi): add VariantGraph.variants generator
bsesic May 22, 2026
830afc6
feat(multi): add AlignedTable, TableColumn, TableCell types
bsesic May 26, 2026
c1db620
feat(multi): add AlignedTable.re_anchor and format_text
bsesic May 26, 2026
4a33997
feat(multi): add GuideTreeNode, GuideTree, and format_text
bsesic May 26, 2026
874f7f2
feat(multi): add pairwise_distances using v0.1 align()
bsesic May 26, 2026
b4a0772
feat(multi): add UPGMA tree construction with deterministic tie-breaking
bsesic May 27, 2026
d1dad50
feat(multi): add post_order_witness_ids traversal helper
bsesic May 27, 2026
e080ba8
feat(multi): add node_match_score aggregation for sequence-vs-graph
bsesic May 27, 2026
db0d7b6
feat(multi): add topological-order helper for POA DP
bsesic May 27, 2026
9e392f5
feat(multi): add POA forward DP for sequence-vs-graph alignment
bsesic May 28, 2026
da5f1ed
feat(multi): add POA traceback over the backpointer table
bsesic May 28, 2026
ca377dc
feat(multi): add align_sequence_to_graph end-to-end merge
bsesic May 28, 2026
3d05e7a
feat(multi): add progressive_merge wrapper
bsesic May 28, 2026
8655126
feat(multi): add MultiAlignerConfig dataclass
bsesic May 28, 2026
6576bf3
feat(multi): add MultiAlignmentResult and align_multi()
bsesic May 28, 2026
a67e426
feat(api): re-export multi-witness types at top level
bsesic May 28, 2026
5d00806
feat(io): add MultiAlignmentResult JSON dump/load
bsesic May 28, 2026
a8b8771
test(multi): pin lossless reconstruction property
bsesic May 28, 2026
275276b
test(multi): pin permutation invariance property
bsesic May 28, 2026
773ab38
test(e2e): add synthetic Hebrew multi-witness golden test
bsesic May 28, 2026
a58d75f
docs(usage): add multi-witness alignment section
bsesic May 28, 2026
4ab6b02
docs: add multi-witness algorithm details and update roadmap
bsesic May 28, 2026
7bf848d
docs: revise README and docs for v0.2 multi-witness features
bsesic May 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 74 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# TRACE

**Textual Reuse, Alignment, and Collation Engine** — a Python library for pairwise philological alignment with pluggable language packs.
**Textual Reuse, Alignment, and Collation Engine** — a Python library for philological alignment with pluggable language packs. Pairwise (v0.1) and simultaneous multi-witness (v0.2) alignment.

[![CI](https://github.com/bsesic/trace/actions/workflows/workflow.yml/badge.svg)](https://github.com/bsesic/trace/actions/workflows/workflow.yml)
[![PyPI version](https://img.shields.io/pypi/v/tracealign.svg)](https://pypi.org/project/tracealign/)
Expand All @@ -17,20 +17,39 @@ TRACE is designed for textual criticism, manuscript witness comparison, and the

- **Tokenizer pipeline** with editorial-marker awareness (`[reconstructed]`, `⟦deletion⟧`, `〈insertion〉`, `(expanded)`, lacunae).
- **Tiered scoring** returning `(score, reason)` per token pair — `EXACT`, `NIQQUD_STRIPPED`, `PLENE_DEFECTIVE`, `ABBREVIATION`, `ORTHOGRAPHIC`, `INSERTION`, `OMISSION`, `NO_MATCH`.
- **Semi-global Needleman–Wunsch** with affine gap penalties (Gotoh) and a **multi-token abbreviation lookahead** (`ר"י` ↔ `רבי ישמעאל`).
- **Pairwise aligner** — semi-global Needleman–Wunsch with affine gap penalties (Gotoh) and a multi-token abbreviation lookahead (`ר"י` ↔ `רבי ישמעאל`).
- **Multi-witness aligner** (v0.2) — N witnesses aligned simultaneously into a canonical variant graph (DAG) plus a derived aligned table view, via pairwise distances → UPGMA guide tree → POA-based progressive merge. Determinism is pinned by a permutation-invariance property test; correctness by a lossless-reconstruction property test.
- **Hebrew language pack** with niqqud strip, plene/defective skeleton matching, gershayim/maqqef tokenizer hooks, and a seed lexicon of rabbinic abbreviations (extendable via `Lexica.merge()`).
- **I/O** for plain text, JSON (round-trip), eScriptorium exports (with bbox + line metadata), and TEI XML (`<tei:w>` mode + flow-text fallback).
- **Reproducible** — every `AlignmentResult` carries `trace_version` and `language_pack_version` in its params.
- **I/O** for plain text, JSON (round-trip for both pairwise and multi-witness results), eScriptorium exports (with bbox + line metadata), and TEI XML (`<tei:w>` mode + flow-text fallback).
- **Reproducible** — every `AlignmentResult` / `MultiAlignmentResult` carries `trace_version` and `language_pack_version` in its params.

## Installation

```bash
pip install tracealign
```

Requires Python 3.10+. Pulls `pydantic`, `numpy`, `lxml`, and `rapidfuzz`.
Requires Python 3.10, 3.11, or 3.12. Pulls `pydantic`, `numpy`, `lxml`, and `rapidfuzz`.

## Quick start
### From source

```bash
git clone https://github.com/bsesic/trace.git
cd trace
pip install -e ".[dev]"
```

The `dev` extra adds `pytest` and `flake8` (the project's quality gates). For documentation contributions, use `pip install -e ".[docs]"` to add Sphinx, furo, and myst-parser.

### Verifying the install

```bash
python -c "import tracealign; print(tracealign.__version__, tracealign.list_languages())"
```

Should print the current version and `['hbo']` (the Hebrew language pack registers itself on import).

## Quick start — pairwise

```python
import tracealign
Expand Down Expand Up @@ -62,31 +81,67 @@ summary: {EXACT: 3, NIQQUD_STRIPPED: 1, PLENE_DEFECTIVE: 1, ABBREVIATION: 1}
אמר ↔ אמר exact 1.00
```

See **[the documentation](https://tracealign.readthedocs.io/en/latest/)** for installation details, the full API, FAQs, and the design rationale.
## Quick start — multi-witness (v0.2)

```python
import tracealign

witnesses = {
"W1": tracealign.tokenize("שלום עולם רַבִּי דויד אמר", lang="hbo", seq_label="W1"),
"W2": tracealign.tokenize("שלום עולם רבי דוד אמר", lang="hbo", seq_label="W2"),
"W3": tracealign.tokenize("שלום עולם ר\"י אמר", lang="hbo", seq_label="W3"),
"W4": tracealign.tokenize("שלום עולם רבי דוד אמר טוב", lang="hbo", seq_label="W4"),
}

result = tracealign.align_multi(witnesses, lang="hbo")

print(result.guide_tree.format_text())
print(result.table.format_text())

for node in result.graph.variants():
readings = {wid: t.text for wid, t in node.tokens.items()}
print(node.id, readings)
```

The `MultiAlignmentResult` exposes a canonical `VariantGraph` (DAG with witness trails), a derived `AlignedTable` (re-anchorable to any witness for presentation), a `GuideTree` (UPGMA-built, carrying the original distance matrix — useful for downstream stemmatic work), and the same reproducibility-aware `params` snapshot the pairwise aligner produces.

JSON persistence works the same way as the pairwise aligner, in its own module:

```python
from tracealign.io import multi_result as mr_io

mr_io.dump(result, "alignment.json")
restored = mr_io.load("alignment.json")
```

See **[the documentation](https://tracealign.readthedocs.io/en/latest/)** for the full API, more usage examples, the algorithm details, FAQs, and the design rationale.

## Documentation

| Section | What it covers |
|---|---|
| [Installation](https://tracealign.readthedocs.io/en/latest/installation.html) | pip / from source / dev setup |
| [Usage](https://tracealign.readthedocs.io/en/latest/usage.html) | Tokenize, align, work with the result, custom lexica |
| [Details](https://tracealign.readthedocs.io/en/latest/details.html) | Tokenizer pipeline, scoring tiers, DP algorithm |
| [FAQ](https://tracealign.readthedocs.io/en/latest/faq.html) | Common questions about scope, language packs, performance |
| [Installation](https://tracealign.readthedocs.io/en/latest/installation.html) | pip / from source / dev setup / docs build |
| [Usage](https://tracealign.readthedocs.io/en/latest/usage.html) | Tokenize, pairwise align, multi-witness align, work with the result, custom lexica, I/O |
| [Details](https://tracealign.readthedocs.io/en/latest/details.html) | Tokenizer pipeline, scoring tiers, pairwise DP algorithm, multi-witness POA pipeline |
| [FAQ](https://tracealign.readthedocs.io/en/latest/faq.html) | Common questions about scope, language packs, performance, multi-witness semantics |
| [Contributing](https://tracealign.readthedocs.io/en/latest/contributing.html) | Development workflow, TDD discipline, branch model |

## Project status

| | |
|---|---|
| Current release | 0.1.1 |
| Roadmap | [docs/ROADMAP.md](docs/ROADMAP.md) |
| Design spec | [docs/superpowers/specs/2026-04-28-trace-v0.1-design.md](docs/superpowers/specs/2026-04-28-trace-v0.1-design.md) |
| Future sub-projects | Multi-witness master graph · Geniza anchor detection · Text-reuse · Critical edition / apparatus |
| Current PyPI release | 0.1.3 (v0.2.0 in flight on `feature/v0.2-multi-witness`) |
| Roadmap | [docs/ROADMAP.md](docs/ROADMAP.md) — ten-stage long-term vision |
| v0.1 design spec | [docs/superpowers/specs/2026-04-28-trace-v0.1-design.md](docs/superpowers/specs/2026-04-28-trace-v0.1-design.md) |
| v0.2 design spec | [docs/superpowers/specs/2026-05-21-trace-v0.2-multi-witness-design.md](docs/superpowers/specs/2026-05-21-trace-v0.2-multi-witness-design.md) |
| Released stages | 1 (pairwise + Hebrew pack) |
| In progress | 2 (master alignment graph / multi-witness) |
| Future sub-projects | Geniza anchor detection · Text-reuse · Apparatus / critical edition · Cross-tradition Hexapla · Stemmatic reconstruction · Allusion detection · Citation graphs · Reception history |

## License
## Citation

[MIT](LICENSE) © 2026 Benjamin Schnabel.
If you use TRACE in academic work, please cite via the [Zenodo concept DOI](https://doi.org/10.5281/zenodo.20315408) (always resolves to the latest archived release) or pick a specific version DOI from the Zenodo record. A `CITATION.cff` is at the repo root — GitHub's "Cite this repository" button generates APA / BibTeX / RIS automatically from it.

## Citation
## License

If you use TRACE in academic work, please cite the repository — a Zenodo DOI will follow with the first non-pre-release tag.
[MIT](LICENSE) © 2026 Benjamin Schnabel.
2 changes: 1 addition & 1 deletion docs/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ The full ambition spans ten stages, each its own brainstorm → spec → plan
| # | Stage | Capability it unlocks | Status |
|---|---|---|---|
| 1 | **Pairwise aligner + Hebrew pack** | TRACE v0.1 — paarweise Alignment-Kernel | ✅ released 0.1.3 |
| 2 | **Master alignment graph** | Simultaneous multi-witness alignment (Sifra full witness set, Tanhuma) | planned (v0.2) |
| 2 | **Master alignment graph** | Simultaneous multi-witness alignment (Sifra full witness set, Tanhuma) | in progress (v0.2 feature/v0.2-multi-witness) |
| 3 | **Geniza fragment anchor detection** | Matching small fragments against a large candidate pool (hundreds of Sifra Genizah fragments) | planned |
| 4 | **Text-reuse detection** | Finding recurring phrases and verbatim citations across a corpus (biblical citations in rabbinic literature, recurring rabbinic formulae) | planned |
| 5 | **Apparatus / critical-edition generation** | Producing publication-grade critical editions (lemmas, sigla, Fließtext) directly from alignment output | planned |
Expand Down
52 changes: 52 additions & 0 deletions docs/details.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,3 +198,55 @@ src/tracealign/
escriptorium.py # eScriptorium JSON importer
tei.py # TEI XML importer
```

## Multi-witness alignment (v0.2)

`align_multi` extends the pairwise aligner to N witnesses. The pipeline is three-phase:

### Phase 1 — Pairwise distances

Every pair of witnesses is aligned with `tracealign.align()` (the v0.1 pairwise aligner) and the distance is computed as `1 − total_score`. The result is a symmetric `N × N` distance matrix; the diagonal is zero. Witness ids are sorted lexicographically before computing, making the matrix independent of dict insertion order.

### Phase 2 — UPGMA guide tree

A binary guide tree is built from the distance matrix using **UPGMA** (Unweighted Pair Group Method with Arithmetic Mean). At every iteration the closest cluster pair is merged. Ties are broken on the canonical `(min, max)` lexicographic order of cluster members, guaranteeing determinism. The tree's `height` field carries the cumulative UPGMA distance — a starting point for later stemmatic work.

### Phase 3 — Progressive POA-based merge

The guide tree is walked in post-order to produce a canonical merge sequence (closely-related witnesses are merged first). The first witness seeds the graph as a linear chain. Each subsequent witness is aligned to the current graph via **partial-order alignment (POA)** — a DP over the topologically sorted graph nodes. Three transitions:

| Transition | Effect on graph |
|---|---|
| Match | Merge the new token into an existing node's `tokens[witness_id]`; extend the witness set on the incoming edge. |
| Insertion in sequence (gap in graph) | Add a new node holding only this witness's token; new edge `prev → new`. |
| Deletion (skip graph node) | The new witness's path bypasses this node — recorded by an edge that skips it. |

`node_match_score` aggregates the per-constituent tiered score across the witnesses already in the target node. The default mode `"max"` is permissive (CollateX-aligned); `"mean"` and `"min"` are configurable.

### Correctness guarantees

Two properties are pinned by tests:

- **Lossless reconstruction.** For every input witness `w`, the path through the result graph yields exactly the original token sequence.
- **Permutation invariance.** The same set of witnesses in any input dict order produces the same alignment (same witness paths, same variant loci).

### Data flow

```
align_multi(witnesses, lang, config)
pairwise_distances — Phase 1: O(N²/2) pairwise alignments
build_upgma — Phase 2: deterministic binary tree
progressive_merge — Phase 3: post-order POA-based merge
VariantGraph
├──► AlignedTable — derived view, re-anchorable
└──► MultiAlignmentResult (graph + table + guide_tree + summary + params)
```
41 changes: 40 additions & 1 deletion docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,4 +88,43 @@ Not specced yet. Candidates from the v0.1 spec:
- Per-project editorial-bracket preset bundles.
- Performance pass (NumPy vectorization or Cython hot path).

Plus the four long-term sub-projects: master alignment graph, Geniza anchor detection, text-reuse, apparatus generation.
The master alignment graph (multi-witness alignment) shipped as v0.2 — see below. Future long-term stages: Geniza anchor detection, text-reuse, apparatus generation, cross-tradition Hexapla, stemmatic reconstruction, allusion detection, citation graphs, reception history.

## How does multi-witness alignment differ from pairwise?

`tracealign.align()` aligns exactly two witnesses. `tracealign.align_multi()` (v0.2) aligns N witnesses at once into a single canonical structure — a variant graph (DAG) where every witness has a trail through the graph, plus a derived aligned table view. Variant loci surface as nodes whose constituent witnesses disagree.

For two witnesses the two paths give similar information; for three or more the multi-witness graph is much more useful than running every pair separately, because it gives one consistent set of variant positions rather than O(N²) overlapping pairwise alignments.

## Is `align_multi` deterministic?

Yes. The result is independent of the dict insertion order of the witnesses. Three sources of order-stability are pinned by tests:

1. `pairwise_distances` sorts witness ids lexicographically before computing the matrix.
2. UPGMA tie-breaking uses the canonical `(min, max)` lexicographic order of cluster members.
3. The topological sort during sequence-vs-graph alignment is stable with respect to node id.

A dedicated property test (`test_permutation_invariance`) re-runs `align_multi` with reordered inputs and asserts that witness paths and variant loci are identical.

## How big can multi-witness alignments get?

The v0.2 target is Sifra-scale: 5–15 witnesses, 1000–5000 tokens each. Larger witness sets (NT-scale, hundreds of witnesses) need anchor-based decomposition, which is a future stage. Geniza fragments specifically are handled in their own future stage (anchor detection against a large candidate pool), not by adding them all to one master graph.

## Why UPGMA and not Neighbor-Joining for the guide tree?

UPGMA is simpler and gives a binary tree with clear cumulative-distance heights — useful as a draft stemma input for the eventual stemmatic-reconstruction stage. UPGMA's "molecular clock" assumption is a known limitation in phylogenetics but is acceptable for ordering the merge sequence in v0.2. Neighbor-Joining is a future v0.x candidate when proper stemmatic reconstruction goes live.

## Can I add a new witness to an existing alignment incrementally?

Not in v0.2.0 — `align_multi` builds the entire graph in a single call. An incremental "add one witness" API is a v0.2.x candidate; it builds naturally on the existing `align_sequence_to_graph` primitive but requires API design (e.g. should the guide tree be re-balanced? should existing alignment relationships be allowed to change?). Open a discussion or issue if you need this.

## How do I persist a multi-witness result?

```python
from tracealign.io import multi_result as mr_io

mr_io.dump(result, "alignment.json")
restored = mr_io.load("alignment.json")
```

`tracealign.io.multi_result` is a dedicated module separate from `tracealign.io.result` (the pairwise JSON I/O). The round-trip preserves the entire result, including the guide tree's distance matrix — important for later stages that reuse it.
11 changes: 6 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
# TRACE

**Textual Reuse, Alignment, and Collation Engine** — a Python library for pairwise philological alignment with pluggable language packs.
**Textual Reuse, Alignment, and Collation Engine** — a Python library for philological alignment with pluggable language packs. Pairwise (v0.1) and simultaneous multi-witness (v0.2) alignment.

TRACE is built for textual criticism, manuscript witness comparison, and the creation of digital synopses and critical editions. The core is language-agnostic; the first shipped language pack covers Biblical and Rabbinic Hebrew (`hbo`).

## At a glance

- **Tokenizer pipeline** with editorial-marker awareness (`[reconstructed]`, `⟦deletion⟧`, `〈insertion〉`, `(expanded)`, lacunae).
- **Tiered scoring** that returns *(score, reason)* per token pair — `EXACT`, `NIQQUD_STRIPPED`, `PLENE_DEFECTIVE`, `ABBREVIATION`, `ORTHOGRAPHIC`, `INSERTION`, `OMISSION`, `NO_MATCH`.
- **Semi-global Needleman–Wunsch** with affine gap penalties (Gotoh) and a multi-token abbreviation lookahead (`ר"י` ↔ `רבי ישמעאל`).
- **Pairwise aligner** — semi-global Needleman–Wunsch with affine gap penalties (Gotoh) and a multi-token abbreviation lookahead (`ר"י` ↔ `רבי ישמעאל`).
- **Multi-witness aligner** (v0.2) — N witnesses aligned simultaneously into a canonical variant graph plus a derived aligned table, via pairwise distances → UPGMA guide tree → POA-based progressive merge. Determinism and lossless reconstruction are pinned by property tests.
- **Hebrew language pack** with niqqud strip, plene/defective skeleton matching, gershayim/maqqef tokenizer hooks, and a seed lexicon of rabbinic abbreviations (extendable via `Lexica.merge()`).
- **I/O** for plain text, JSON (round-trip), eScriptorium exports, and TEI XML.
- **Reproducible**: every `AlignmentResult` carries `trace_version` and `language_pack_version` in its params.
- **I/O** for plain text, JSON (round-trip for both pairwise and multi-witness results), eScriptorium exports, and TEI XML.
- **Reproducible**: every `AlignmentResult` / `MultiAlignmentResult` carries `trace_version` and `language_pack_version` in its params.

## Get going

Expand All @@ -28,7 +29,7 @@ contributing

## Project status

TRACE is an early-stage research library. v0.1.x ships the pairwise aligner and the Hebrew pack; future sub-projects cover multi-witness master graphs, Geniza fragment anchor detection, text-reuse detection, and apparatus / critical-edition generation. See the [roadmap](https://github.com/bsesic/trace/blob/main/docs/ROADMAP.md) for the long-term plan.
TRACE is an early-stage research library. v0.1.x ships the pairwise aligner and the Hebrew pack; v0.2 adds the multi-witness master alignment graph. Future stages cover Geniza fragment anchor detection, text-reuse detection, apparatus / critical-edition generation, cross-tradition Hexapla-style alignment, stemmatic reconstruction, allusion detection, citation graphs, and multi-millennial reception history. See the [roadmap](https://github.com/bsesic/trace/blob/main/docs/ROADMAP.md) for the long-term ten-stage plan.

## License

Expand Down
Loading
Loading