Skip to content

Language-pack-driven clause/colon chunker (standoff, particle-tiered) #18

Description

@bsesic

Summary

Add an optional, language-pack-driven clause/colon chunking module that emits boundaries as
standoff offsets over the unsegmented base text (never baked into the text). Needed only for
the cross-lingual alignment path (#17); same-language collation should continue to
align the token stream directly without pre-segmentation.

Motivation

Cross-lingual cola alignment needs sense-unit segmentation, but medieval Hebrew/Arabic prose has
no interpunctuation and is paratactic — it marks structure with connective particles, not
punctuation. ML segmenters trained on Modern Standard Arabic / Modern Hebrew newswire perform
badly here. A transparent, debuggable, rule-based particle splitter matches how these texts
actually segment themselves and needs no training data.

Design principle: deliberately over-segment and let the aligner merge — too-small cola still
align; missing boundaries silently swallow misalignments. And record boundaries as standoff so
the scheme can be revised per genre without rebuilding the text.

Scope

  • A chunk(tokens, lang=...) API returning span annotations (start/end token indices or source
    offsets), not mutated text.
  • Tiered particle inventory per language pack — a new optional hook on LanguagePack:
    • Hard boundaries (split confidently): Arabic fa-, thumma, ḥattā, the ammā … fa-
      frame, qāla/wa-qāla pivots; Hebrew relative she-/asher, ki, clear infinitival/
      participial pivots.
    • Weak/candidate boundaries (licensed contexts only): bare wa- / conjunctive waw — split
      only in licensed contexts (e.g. wa- + finite verb, wa-kāna, wa-qāla), never wa- + noun.
  • Genre-tunable: the particle tiers should be parameterizable, since sajʿ / narrative / legal /
    philosophical prose want different rules.

Acceptance criteria

  • chunk() produces standoff spans over an unsegmented base; original token stream/text is unchanged.
  • Hebrew tiered splitter implemented against the existing hbo pack with tests; Arabic tiers land with the Arabic pack (Arabic language pack (ara) — proclitic tokenization + orthographic normalization #16).
  • The wa-/waw weak-boundary licensing is covered by explicit positive/negative tests.
  • Output integrates as an input to the cross-lingual aligner (cola feed), validated on a small synthetic fixture.
  • TDD; suite green on 3.10/3.11/3.12; flake8 clean.

Notes

Roadmap: supporting component for Stage 6; complements (does not replace) the token-stream
collation approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions