Skip to content

Cross-lingual alignment path: anchor-based mode + embedding-fill delegation hook #17

Description

@bsesic

Summary

Add a cross-lingual alignment path to TRACE: an anchor-based alignment mode plus a
delegation hook for filling the spans between anchors with an external embedding aligner.
This is the central problem in the scholarly-editions use case (Judeo-Arabic ↔ Tibbonide
Hebrew, Arabic ↔ Latin) and is roadmap Stage 6 (cross-tradition / Hexapla-style), pulled
forward by a concrete adopter need.

Motivation

TRACE today is monolingual: all scoring tiers assume same-language token comparison. A
translation shares essentially no surface form with its source, so NW over the token stream
does not apply. The robust, defensible approach for this material is a hybrid:

  1. a high-precision anchor net — named entities, numerals, divine names, citation formulae,
    and a calque-cognate lexicon bootstrapped from the text (the Tibbonides' mechanical calquing
    makes this unusually tractable) — as the alignment skeleton;
  2. an embedding aligner (dicta / LaBSE / Bertalign-style) filling only the short monotonic
    spans between anchors, where out-of-distribution embeddings still behave.

TRACE's reason-tagged, standoff variant-graph model is the natural home for the anchor skeleton;
the embedding fill should be an external adapter TRACE delegates to, not a built-in dependency.

Scope

  • Anchor detection / alignment mode: a cross-lingual alignment entry point that aligns two
    sequences in different languages by first matching anchors (pluggable anchor extractors:
    numerals, NE lists, citation-formula patterns, a user-supplied bilingual lemma/calque
    lexicon), producing fixed correspondence points expressed as matches with a new ANCHOR
    reason tag.
  • Between-anchor fill via delegation: define a clean interface (a callback/protocol) so the
    caller supplies an embedding-aligner adapter for the spans between anchors. TRACE constrains
    it to short monotonic windows fenced by anchors; TRACE does not vendor any embedding model.
  • Output: anchored + filled alignment in the existing standoff/variant-graph representation,
    every match carrying its reason (ANCHOR vs. delegated-fill) and the producing adapter id, so
    the result stays auditable.

Acceptance criteria

  • A documented cross-lingual API (e.g. align_crosslingual(seq_a, seq_b, lang_a, lang_b, anchors=..., fill_adapter=...)).
  • Anchor extractors for numerals and a user-supplied bilingual lexicon, with tests on a small synthetic Arabic↔Hebrew fixture.
  • A no-op / pluggable fill adapter interface with a reference stub (so the path is testable without an embedding model dependency).
  • ANCHOR reason tag added; provenance of fill matches recorded.
  • No new heavy/runtime ML dependency added to the core package (embedding adapters live behind the interface, optional extras at most).
  • TDD; suite green on 3.10/3.11/3.12; flake8 clean.

Notes / risks

Roadmap: brings forward Stage 6 (cross-tradition alignment).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions