Summary
Add an optional, language-pack-driven clause/colon chunking module that emits boundaries as
standoff offsets over the unsegmented base text (never baked into the text). Needed only for
the cross-lingual alignment path (#17); same-language collation should continue to
align the token stream directly without pre-segmentation.
Motivation
Cross-lingual cola alignment needs sense-unit segmentation, but medieval Hebrew/Arabic prose has
no interpunctuation and is paratactic — it marks structure with connective particles, not
punctuation. ML segmenters trained on Modern Standard Arabic / Modern Hebrew newswire perform
badly here. A transparent, debuggable, rule-based particle splitter matches how these texts
actually segment themselves and needs no training data.
Design principle: deliberately over-segment and let the aligner merge — too-small cola still
align; missing boundaries silently swallow misalignments. And record boundaries as standoff so
the scheme can be revised per genre without rebuilding the text.
Scope
- A
chunk(tokens, lang=...) API returning span annotations (start/end token indices or source
offsets), not mutated text.
- Tiered particle inventory per language pack — a new optional hook on
LanguagePack:
- Hard boundaries (split confidently): Arabic
fa-, thumma, ḥattā, the ammā … fa-
frame, qāla/wa-qāla pivots; Hebrew relative she-/asher, ki, clear infinitival/
participial pivots.
- Weak/candidate boundaries (licensed contexts only): bare
wa- / conjunctive waw — split
only in licensed contexts (e.g. wa- + finite verb, wa-kāna, wa-qāla), never wa- + noun.
- Genre-tunable: the particle tiers should be parameterizable, since sajʿ / narrative / legal /
philosophical prose want different rules.
Acceptance criteria
Notes
Roadmap: supporting component for Stage 6; complements (does not replace) the token-stream
collation approach.
Summary
Add an optional, language-pack-driven clause/colon chunking module that emits boundaries as
standoff offsets over the unsegmented base text (never baked into the text). Needed only for
the cross-lingual alignment path (#17); same-language collation should continue to
align the token stream directly without pre-segmentation.
Motivation
Cross-lingual cola alignment needs sense-unit segmentation, but medieval Hebrew/Arabic prose has
no interpunctuation and is paratactic — it marks structure with connective particles, not
punctuation. ML segmenters trained on Modern Standard Arabic / Modern Hebrew newswire perform
badly here. A transparent, debuggable, rule-based particle splitter matches how these texts
actually segment themselves and needs no training data.
Design principle: deliberately over-segment and let the aligner merge — too-small cola still
align; missing boundaries silently swallow misalignments. And record boundaries as standoff so
the scheme can be revised per genre without rebuilding the text.
Scope
chunk(tokens, lang=...)API returning span annotations (start/end token indices or sourceoffsets), not mutated text.
LanguagePack:fa-,thumma,ḥattā, theammā … fa-frame,
qāla/wa-qālapivots; Hebrew relativeshe-/asher,ki, clear infinitival/participial pivots.
wa-/ conjunctive waw — splitonly in licensed contexts (e.g.
wa-+ finite verb,wa-kāna,wa-qāla), neverwa-+ noun.philosophical prose want different rules.
Acceptance criteria
chunk()produces standoff spans over an unsegmented base; original token stream/text is unchanged.hbopack with tests; Arabic tiers land with the Arabic pack (Arabic language pack (ara) — proclitic tokenization + orthographic normalization #16).wa-/waw weak-boundary licensing is covered by explicit positive/negative tests.flake8clean.Notes
for the Arabic side.
Roadmap: supporting component for Stage 6; complements (does not replace) the token-stream
collation approach.