Add Arabic language pack (ara) — proclitic tokenization + orthographic normalization by bsesic · Pull Request #21 · bsesic/trace

bsesic · 2026-06-25T11:28:54Z

Summary

Adds the Arabic language pack (ara) — the first non-Hebrew language pack, exercising and validating the LanguagePack abstraction. Closes #16.

Rule-based and dependency-light (stdlib + existing rapidfuzz only), mirroring the structure of src/tracealign/lang/hebrew/.

What's included

Tokenization (tokenize.py): conservative proclitic splitting — splits the definite article ال, single-letter proclitics (و/ف/ب/ك) only before the article, and the لل (li- + article) contraction. Deliberately does not amputate word-initial radicals (وزير, باب, كتاب). Spans are contiguous (no separator character).
Normalization (normalize.py): strip_tashkil (NFC, remove combining marks + tatweel) into text; orthographic skeleton folding (alif variants → ا, ة → ه, ى → ي, hamza seats, bare hamza dropped). Diplomatic form preserved in Token.raw.
Scoring (scoring.py): four tiers — EXACT → DIACRITICS_STRIPPED → ORTHOGRAPHIC_VARIANT (skeleton) → ORTHOGRAPHIC (fuzzy).
Reason enum: two new script-neutral values — DIACRITICS_STRIPPED, ORTHOGRAPHIC_VARIANT.
Pack + registration: ArabicLanguagePack (code="ara", aliases=("arabic",), version="ara-0.1.0", empty Lexica()), wired via the side-effect-import + _BUILTIN_PACK_MODULES pattern.

Design notes

Conservative splitting chosen over maximal: over-splitting damages alignment; under-splitting is recoverable by the fuzzy tier. This also means no guard lexicon / data files are needed.
Full design rationale: docs/superpowers/specs/2026-06-24-arabic-language-pack-design.md.

Testing

227 tests passing; new Arabic package has 33 tests (tokenize incl. negative + لل + كال cases, normalize, scoring tiers, registry, end-to-end align with language_pack_version in params).
flake8 src/ tests/ clean.
Verified green on Python 3.10 and 3.12.

Follow-up (non-blocking, from review)

Cross-pack consistency of the fuzzy-tier details["layer"] label (Arabic "fuzzy" vs Hebrew "orthographic") — left as a deliberate design question.

Design for issue #16: first non-Hebrew language pack. Conservative proclitic splitting (article + single-letter proclitics before the article only; no radical amputation) and granular reason tags (DIACRITICS_STRIPPED, ORTHOGRAPHIC_VARIANT). No guard lexicon needed. Refs #16

Six TDD tasks: Reason enum, normalize, tokenize (proclitics), scoring, pack + registration, full-suite verification. Refs #16

Refs #16

…mment length guards Refs #16

bsesic added 8 commits June 24, 2026 21:35

docs(plan): Arabic language pack (ara) implementation plan

82c9ce3

Six TDD tasks: Reason enum, normalize, tokenize (proclitics), scoring, pack + registration, full-suite verification. Refs #16

feat(model): add DIACRITICS_STRIPPED and ORTHOGRAPHIC_VARIANT reasons

831bef9

Refs #16

feat(lang/arabic): tashkil strip and orthographic skeleton folding

0f912de

Refs #16

feat(lang/arabic): conservative proclitic splitting

31d3f65

Refs #16

feat(lang/arabic): tiered scoring predicates

d85e62e

Refs #16

feat(lang/arabic): ArabicLanguagePack and registration

2411ec3

Refs #16

test(lang/arabic): cover لل short-guard and كال proclitic+article; co…

761e079

…mment length guards Refs #16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Arabic language pack (ara) — proclitic tokenization + orthographic normalization#21

Add Arabic language pack (ara) — proclitic tokenization + orthographic normalization#21
bsesic wants to merge 8 commits into
developfrom
feature/arabic-language-pack

bsesic commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bsesic commented Jun 25, 2026

Summary

What's included

Design notes

Testing

Follow-up (non-blocking, from review)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant