Language-pack-driven clause/colon chunker (standoff, particle-tiered)

## Summary

Add an optional, language-pack-driven **clause/colon chunking module** that emits boundaries as
**standoff offsets over the unsegmented base text** (never baked into the text). Needed only for
the cross-lingual alignment path (#17); same-language collation should continue to
align the token stream directly without pre-segmentation.

## Motivation

Cross-lingual cola alignment needs sense-unit segmentation, but medieval Hebrew/Arabic prose has
no interpunctuation and is paratactic — it marks structure with connective particles, not
punctuation. ML segmenters trained on Modern Standard Arabic / Modern Hebrew newswire perform
badly here. A transparent, debuggable, **rule-based particle splitter** matches how these texts
actually segment themselves and needs no training data.

Design principle: **deliberately over-segment and let the aligner merge** — too-small cola still
align; missing boundaries silently swallow misalignments. And record boundaries as standoff so
the scheme can be revised per genre without rebuilding the text.

## Scope

- A `chunk(tokens, lang=...)` API returning span annotations (start/end token indices or source
  offsets), not mutated text.
- **Tiered particle inventory per language pack** — a new optional hook on `LanguagePack`:
  - *Hard boundaries* (split confidently): Arabic `fa-`, `thumma`, `ḥattā`, the `ammā … fa-`
    frame, `qāla`/`wa-qāla` pivots; Hebrew relative `she-`/`asher`, `ki`, clear infinitival/
    participial pivots.
  - *Weak/candidate boundaries* (licensed contexts only): bare `wa-` / conjunctive waw — split
    only in licensed contexts (e.g. `wa-` + finite verb, `wa-kāna`, `wa-qāla`), never `wa-` + noun.
- Genre-tunable: the particle tiers should be parameterizable, since sajʿ / narrative / legal /
  philosophical prose want different rules.

## Acceptance criteria

- [ ] `chunk()` produces standoff spans over an unsegmented base; original token stream/text is unchanged.
- [ ] Hebrew tiered splitter implemented against the existing `hbo` pack with tests; Arabic tiers land with the Arabic pack (#16).
- [ ] The `wa-`/waw weak-boundary licensing is covered by explicit positive/negative tests.
- [ ] Output integrates as an input to the cross-lingual aligner (cola feed), validated on a small synthetic fixture.
- [ ] TDD; suite green on 3.10/3.11/3.12; `flake8` clean.

## Notes

- Tokenization (proclitic splitting) is a prerequisite — depends on the Arabic pack (#16)
  for the Arabic side.
- This module is **opt-in**; the collation path must remain segmentation-free by default.

Roadmap: supporting component for **Stage 6**; complements (does not replace) the token-stream
collation approach.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language-pack-driven clause/colon chunker (standoff, particle-tiered) #18

Summary

Motivation

Scope

Acceptance criteria

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Language-pack-driven clause/colon chunker (standoff, particle-tiered) #18

Description

Summary

Motivation

Scope

Acceptance criteria

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions