Skip to content

Add Arabic language pack (ara) — proclitic tokenization + orthographic normalization#21

Open
bsesic wants to merge 8 commits into
developfrom
feature/arabic-language-pack
Open

Add Arabic language pack (ara) — proclitic tokenization + orthographic normalization#21
bsesic wants to merge 8 commits into
developfrom
feature/arabic-language-pack

Conversation

@bsesic

@bsesic bsesic commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Summary

Adds the Arabic language pack (ara) — the first non-Hebrew language pack, exercising and validating the LanguagePack abstraction. Closes #16.

Rule-based and dependency-light (stdlib + existing rapidfuzz only), mirroring the structure of src/tracealign/lang/hebrew/.

What's included

  • Tokenization (tokenize.py): conservative proclitic splitting — splits the definite article ال, single-letter proclitics (و/ف/ب/ك) only before the article, and the لل (li- + article) contraction. Deliberately does not amputate word-initial radicals (وزير, باب, كتاب). Spans are contiguous (no separator character).
  • Normalization (normalize.py): strip_tashkil (NFC, remove combining marks + tatweel) into text; orthographic skeleton folding (alif variants → ا, ة → ه, ى → ي, hamza seats, bare hamza dropped). Diplomatic form preserved in Token.raw.
  • Scoring (scoring.py): four tiers — EXACTDIACRITICS_STRIPPEDORTHOGRAPHIC_VARIANT (skeleton) → ORTHOGRAPHIC (fuzzy).
  • Reason enum: two new script-neutral values — DIACRITICS_STRIPPED, ORTHOGRAPHIC_VARIANT.
  • Pack + registration: ArabicLanguagePack (code="ara", aliases=("arabic",), version="ara-0.1.0", empty Lexica()), wired via the side-effect-import + _BUILTIN_PACK_MODULES pattern.

Design notes

  • Conservative splitting chosen over maximal: over-splitting damages alignment; under-splitting is recoverable by the fuzzy tier. This also means no guard lexicon / data files are needed.
  • Full design rationale: docs/superpowers/specs/2026-06-24-arabic-language-pack-design.md.

Testing

  • 227 tests passing; new Arabic package has 33 tests (tokenize incl. negative + لل + كال cases, normalize, scoring tiers, registry, end-to-end align with language_pack_version in params).
  • flake8 src/ tests/ clean.
  • Verified green on Python 3.10 and 3.12.

Follow-up (non-blocking, from review)

  • Cross-pack consistency of the fuzzy-tier details["layer"] label (Arabic "fuzzy" vs Hebrew "orthographic") — left as a deliberate design question.

bsesic added 8 commits June 24, 2026 21:35
Design for issue #16: first non-Hebrew language pack. Conservative
proclitic splitting (article + single-letter proclitics before the
article only; no radical amputation) and granular reason tags
(DIACRITICS_STRIPPED, ORTHOGRAPHIC_VARIANT). No guard lexicon needed.

Refs #16
Six TDD tasks: Reason enum, normalize, tokenize (proclitics), scoring,
pack + registration, full-suite verification.

Refs #16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant