bsesic · bsesic · Jun 24, 2026 · Jun 24, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/docs/superpowers/plans/2026-06-24-arabic-language-pack.md b/docs/superpowers/plans/2026-06-24-arabic-language-pack.md
diff --git a/docs/superpowers/specs/2026-06-24-arabic-language-pack-design.md b/docs/superpowers/specs/2026-06-24-arabic-language-pack-design.md
@@ -0,0 +1,107 @@
+# Design: Arabic Language Pack (`ara`) — v0.1.0
+
+**Date:** 2026-06-24
+**Issue:** [#16](https://github.com/bsesic/trace/issues/16) — Arabic language pack (`ara`) — proclitic tokenization + orthographic normalization
+**Roadmap:** First non-Hebrew language pack; prerequisite for the cross-lingual alignment path (#17), the clause/colon chunker (#18), and the Judaeo-Arabic transliteration helper (#20). Validates the `lang/base.py` / `register_language` abstraction on its first non-Hebrew exercise.
+
+## Goal
+
+Add an Arabic language pack so that `tokenize(text, lang="ara")` and `align(..., lang="ara")` work end-to-end, mirroring the structure of `src/tracealign/lang/hebrew/`. The pack handles proclitic segmentation and Arabic orthographic normalization through tiered scoring, staying rule-based and dependency-light (no CAMeL Tools / ML dependency), consistent with the project's stdlib-leaning ethos.
+
+## Scope decisions (resolved during brainstorming)
+
+1. **Proclitic splitting strategy: conservative / high-precision.** Split only where the signal is unambiguous. Over-splitting damages alignment (spurious tokens mis-align); under-splitting is recoverable by the fuzzy tier. Precision over recall. This choice also means **no curated guard lexicon is needed**.
+2. **Reason vocabulary: granular — two new `Reason` values** (`DIACRITICS_STRIPPED`, `ORTHOGRAPHIC_VARIANT`). Each apparatus reason stays crisp for later critical-edition generation (Stage 5): an orthographic variant looks different from a fuzzy guess.
+
+## 1. Package structure
+
+Mirrors `lang/hebrew/`:
+
+```
+src/tracealign/lang/arabic/
+  __init__.py      # register_language(ArabicLanguagePack())  — side-effect import
+  pack.py          # ArabicLanguagePack(LanguagePack)
+  tokenize.py      # split_proclitics()  — post_tokenize hook
+  normalize.py     # strip_tashkil(), skeleton()
+  scoring.py       # arabic_scoring_tiers()
+```
+
+Registration: add `import tracealign.lang.arabic` alongside the Hebrew side-effect import in `src/tracealign/__init__.py` (currently line 37) and add `"tracealign.lang.arabic"` to `_BUILTIN_PACK_MODULES` (line 40) so the test-isolation reload helper restores it.
+
+**No `data/` directory.** The conservative splitting strategy requires no guard lexicon, so the pack sets `self.lexica = Lexica()` (empty). This is a deliberate consequence of decision (1) — no unused lexicon scaffolding is created (YAGNI).
+
+## 2. Pack metadata
+
+- `code = "ara"`
+- `aliases = ("arabic",)`
+- `version = "ara-0.1.0"`
+- `mid_word_chars = ""` — Arabic letters are Unicode category `L`, so the generic `pretokenize` handles them; `_DEFAULT_PUNCT` already contains the Arabic punctuation `،؛`.
+
+`version` is surfaced automatically: `align()` writes `language_pack_version: pack.version` into the result `params` via `needleman_wunsch.py:376`. No aligner change needed.
+
+## 3. Tokenization — `split_proclitics()` (post_tokenize)
+
+Arabic proclitics attach with **no separator character** (unlike the Hebrew maqqef). Spans are therefore contiguous: the cursor advances by `len(part)` with no `+1` gap between parts.
+
+Conservative rules — split only on unambiguous signals:
+
+| Input | Split | Rule |
+|---|---|---|
+| الكتاب | `ال` ǀ `كتاب` | Article `ال` (alif-lam) + remainder, when `len(token) > 2` |
+| والكتاب | `و` ǀ `الكتاب` | Single-letter proclitic (و/ف/ب/ك) **only when immediately followed by the article `ال`** |
+| بالبيت | `ب` ǀ `البيت` | same as above |
+| للكتاب | `ل` ǀ `لكتاب` | Special case: `li-` + article, alif elided (`لل`) → strip first `ل`, host keeps the reduced article form `لكتاب` |
+| وكتاب | — | bare و + radical → **no split** |
+| وزير، باب، كتاب | — | radical-initial → **no split** |
+
+Flags: the proclitic part gets flag `proclitic`; the host part gets `compound_part` (mirroring Hebrew's compound flag). The `لل` special case is covered by an explicit test.
+
+**Decision recorded:** the `لل` case strips the first `ل` and leaves the host as `لكتاب` (reduced-article form). We do not attempt to restore the elided alif; downstream scoring treats `لكتاب` as the host token's `raw`.
+
+## 4. Normalization
+
+- `raw` = diplomatic form (with tashkil), preserved unchanged.
+- `text` = NFC → remove combining marks (category `Mn`: fatha, kasra, damma, sukun, shadda, tanwin) **and** remove tatweel `ـ` (U+0640, category `Lm`, decorative elongation — stripped explicitly since it is not a combining mark).
+- `representations["skeleton"]` = orthographic folding applied on top of `text`:
+  - Alif variants: `أ إ آ ٱ → ا`
+  - Taa marbuta: `ة → ه`
+  - Alif maqsura / final ya: `ى → ي` (one canonical direction)
+  - Hamza seats: `ؤ → و`, `ئ → ي`, bare `ء` removed
+
+## 5. Scoring tiers — `arabic_scoring_tiers()`
+
+Enum extension in `src/tracealign/model.py`: add `DIACRITICS_STRIPPED` and `ORTHOGRAPHIC_VARIANT` to `Reason`.
+
+| Tier | Predicate | Score | Reason | `details.layer` |
+|---|---|---|---|---|
+| 1 | `a.raw == b.raw` | 1.0 | `EXACT` | — |
+| 2 | `a.text == b.text` ∧ `a.raw != b.raw` | 0.95 | `DIACRITICS_STRIPPED` | — |
+| 3 | `skeleton == skeleton` ∧ `a.text != b.text` | 0.90 | `ORTHOGRAPHIC_VARIANT` | `"skeleton"` |
+| 4 | `rapidfuzz.fuzz.ratio / 100 ≥ 0.6` | `ratio * 0.9` | `ORTHOGRAPHIC` | `"fuzzy"` |
+
+Tier predicates mirror `lang/hebrew/scoring.py` in shape and return `TierResult`. Score constants mirror the Hebrew ladder (0.95 / 0.90 / scaled fuzzy). No `ABBREVIATION` tier: Arabic abbreviation handling is out of scope for v0.1.0 (no abbreviation lexicon).
+
+## 6. Tests (TDD — written red first)
+
+- **tokenize:** every split case in §3, including the **negative cases** (وكتاب، وزير، باب must NOT split) and the `لل` special case; span correctness for contiguous parts.
+- **normalize:** tashkil stripping, tatweel removal, each folding rule individually; `raw` remains the diplomatic form.
+- **scoring:** one hit per tier with the correct `Reason` tag and `details.layer` where applicable.
+- **registry:** `list_languages()` includes `"ara"`; `get_language("arabic")` resolves via alias.
+- **end-to-end:** `tokenize(t, lang="ara")` and `align(a, b, lang="ara")` run; `params["language_pack_version"] == "ara-0.1.0"`.
+- Full suite green on the 3.10 / 3.11 / 3.12 matrix; `flake8 src/ tests/` clean.
+
+## 7. Out of scope (per issue #16)
+
+- Clause/colon boundary particle inventory → issue #18 (chunker).
+- Judaeo-Arabic written in Hebrew script → issue #20 (transliteration helper).
+- Any cross-lingual scoring → issue #17.
+- Syriac (`syr`) and Persian (`fas`) packs → separate follow-on issues once this pack lands and the abstraction is proven.
+
+## Acceptance criteria (from issue #16)
+
+- [ ] `list_languages()` includes `ara`; `tokenize`/`align` with `lang="ara"` work end-to-end.
+- [ ] Proclitic split separates `wa-`/`fa-`/`al-` etc. and does **not** split radical `w`/`f` (targeted tests).
+- [ ] Orthographic normalization collapses alif/hamza/taa-marbuta/ya variants into a skeleton; diplomatic form preserved in `raw`.
+- [ ] Tiered scoring returns reason tags consistent with the `Reason` enum (extended with two Arabic-relevant, script-neutral reasons, justified above).
+- [ ] Tests follow TDD; full suite green on 3.10/3.11/3.12; `flake8` clean.
+- [ ] `pack.version` set (`ara-0.1.0`) and surfaced in result `params`.
diff --git a/src/tracealign/__init__.py b/src/tracealign/__init__.py
@@ -33,11 +33,15 @@
 
 __version__ = "0.4.0.dev0"
 
-# Force Hebrew pack registration on first import.
+# Force Hebrew and Arabic pack registration on first import.
 import tracealign.lang.hebrew  # noqa: F401  -- side effect: registers HBO pack
+import tracealign.lang.arabic  # noqa: F401  -- side effect: registers ARA pack
 
 # Built-in pack module names; used to restore registrations after test resets.
-_BUILTIN_PACK_MODULES = ("tracealign.lang.hebrew",)
+_BUILTIN_PACK_MODULES = (
+    "tracealign.lang.hebrew",
+    "tracealign.lang.arabic",
+)
 
 
 def _reload_builtin_packs() -> None:

diff --git a/src/tracealign/lang/arabic/__init__.py b/src/tracealign/lang/arabic/__init__.py
@@ -0,0 +1,6 @@
+"""Arabic language pack — auto-registers on import."""
+
+from tracealign.lang.arabic.pack import ArabicLanguagePack
+from tracealign.lang.registry import register_language
+
+register_language(ArabicLanguagePack())
diff --git a/src/tracealign/lang/arabic/normalize.py b/src/tracealign/lang/arabic/normalize.py
@@ -0,0 +1,37 @@
+"""Arabic normalization: tashkil strip and orthographic skeleton folding."""
+
+from __future__ import annotations
+
+import unicodedata
+
+TATWEEL = "ـ"  # ARABIC TATWEEL (kashida), decorative elongation
+
+# Orthographic folding table applied on top of a tashkil-free string.
+# Alif variants -> bare alif; taa marbuta -> haa; alif maqsura -> ya;
+# hamza seats -> their carrier letter; bare hamza dropped.
+_FOLD = {
+    "أ": "ا",  # ALEF WITH HAMZA ABOVE  أ -> ا
+    "إ": "ا",  # ALEF WITH HAMZA BELOW  إ -> ا
+    "آ": "ا",  # ALEF WITH MADDA ABOVE  آ -> ا
+    "ٱ": "ا",  # ALEF WASLA            ٱ -> ا
+    "ة": "ه",  # TEH MARBUTA           ة -> ه
+    "ى": "ي",  # ALEF MAKSURA          ى -> ي
+    "ؤ": "و",  # WAW WITH HAMZA        ؤ -> و
+    "ئ": "ي",  # YEH WITH HAMZA        ئ -> ي
+    "ء": "",        # HAMZA                 ء -> (dropped)
+}
+
+
+def strip_tashkil(text: str) -> str:
+    """NFC-normalize, then remove combining marks (Mn) and tatweel."""
+    text = unicodedata.normalize("NFC", text)
+    return "".join(
+        ch
+        for ch in text
+        if unicodedata.category(ch) != "Mn" and ch != TATWEEL
+    )
+
+
+def skeleton(text_no_tashkil: str) -> str:
+    """Apply orthographic folding to a tashkil-free string."""
+    return "".join(_FOLD.get(ch, ch) for ch in text_no_tashkil)
diff --git a/src/tracealign/lang/arabic/pack.py b/src/tracealign/lang/arabic/pack.py
@@ -0,0 +1,45 @@
+"""ArabicLanguagePack."""
+
+from __future__ import annotations
+
+from tracealign.lang.arabic.normalize import skeleton, strip_tashkil
+from tracealign.lang.arabic.tokenize import split_proclitics
+from tracealign.lang.base import LanguagePack, ScoringTier
+from tracealign.model import Lexica, Token
+from tracealign.tokenize.base import RawToken
+
+
+class ArabicLanguagePack(LanguagePack):
+    code = "ara"
+    aliases = ("arabic",)
+    version = "ara-0.1.0"
+    word_chars = ""
+    mid_word_chars = ""
+
+    def __init__(self, lexica: Lexica | None = None) -> None:
+        # Conservative splitting needs no guard lexicon; an empty Lexica is
+        # intentional (see design spec).
+        self.lexica = lexica if lexica is not None else Lexica()
+
+    def post_tokenize(self, raws: list[RawToken]) -> list[RawToken]:
+        return split_proclitics(raws)
+
+    def normalize(self, raw: RawToken) -> Token:
+        # `id` and `position` are pack-local placeholders derived from the raw
+        # character span; the public `tokenize()` overrides both with
+        # sequence-index values keyed by `seq_label`.
+        text = strip_tashkil(raw.raw)
+        return Token(
+            id=f"ara:{raw.span[0]:06d}",
+            position=raw.span[0],
+            raw=raw.raw,
+            text=text,
+            representations={"skeleton": skeleton(text)},
+            flags=set(raw.flags),
+            source_span=raw.span,
+            metadata={},
+        )
+
+    def scoring_tiers(self) -> list[ScoringTier]:
+        from tracealign.lang.arabic.scoring import arabic_scoring_tiers
+        return arabic_scoring_tiers()
diff --git a/src/tracealign/lang/arabic/scoring.py b/src/tracealign/lang/arabic/scoring.py
@@ -0,0 +1,62 @@
+"""Arabic scoring-tier predicates."""
+
+from __future__ import annotations
+
+from rapidfuzz.fuzz import ratio
+
+from tracealign.lang.base import LanguagePack, ScoringTier, TierResult
+from tracealign.model import Reason, Token
+
+
+def exact_predicate(a: Token, b: Token, pack: LanguagePack) -> TierResult | None:
+    if a.raw == b.raw:
+        return TierResult(score=1.0)
+    return None
+
+
+def diacritics_stripped_predicate(
+    a: Token, b: Token, pack: LanguagePack
+) -> TierResult | None:
+    if a.text == b.text and a.raw != b.raw:
+        return TierResult(score=0.95)
+    return None
+
+
+def orthographic_variant_predicate(
+    a: Token, b: Token, pack: LanguagePack
+) -> TierResult | None:
+    sk_a = a.representations.get("skeleton")
+    sk_b = b.representations.get("skeleton")
+    if sk_a is None or sk_b is None:
+        return None
+    if sk_a == sk_b and a.text != b.text:
+        return TierResult(score=0.90, details={"layer": "skeleton"})
+    return None
+
+
+def orthographic_predicate(
+    a: Token,
+    b: Token,
+    pack: LanguagePack,
+    *,
+    threshold: float = 0.6,
+) -> TierResult | None:
+    r = ratio(a.text, b.text) / 100.0
+    if r < threshold:
+        return None
+    return TierResult(score=r * 0.9, details={"layer": "fuzzy", "ratio": r})
+
+
+def arabic_scoring_tiers() -> list[ScoringTier]:
+    return [
+        ScoringTier(reason=Reason.EXACT, predicate=exact_predicate),
+        ScoringTier(
+            reason=Reason.DIACRITICS_STRIPPED,
+            predicate=diacritics_stripped_predicate,
+        ),
+        ScoringTier(
+            reason=Reason.ORTHOGRAPHIC_VARIANT,
+            predicate=orthographic_variant_predicate,
+        ),
+        ScoringTier(reason=Reason.ORTHOGRAPHIC, predicate=orthographic_predicate),
+    ]
diff --git a/src/tracealign/lang/arabic/tokenize.py b/src/tracealign/lang/arabic/tokenize.py
@@ -0,0 +1,65 @@
+"""Arabic-specific tokenizer hooks: conservative proclitic splitting."""
+
+from __future__ import annotations
+
+from tracealign.tokenize.base import RawToken
+
+ALEF = "ا"
+LAM = "ل"
+ARTICLE = ALEF + LAM  # ال
+# Single-letter proclitics that we split only when they precede the article.
+_PROCLITIC_LETTERS = ("و", "ف", "ب", "ك")  # و ف ب ك
+
+
+def _emit_split(r: RawToken, cut: int) -> list[RawToken]:
+    """Split RawToken `r` at character offset `cut` into proclitic + host.
+
+    Spans are contiguous (no separator character between Arabic proclitic
+    and host).
+    """
+    start = r.span[0]
+    proclitic = RawToken(
+        raw=r.raw[:cut],
+        span=(start, start + cut),
+        flags=set(r.flags) | {"proclitic"},
+    )
+    host = RawToken(
+        raw=r.raw[cut:],
+        span=(start + cut, r.span[1]),
+        flags=set(r.flags) | {"compound_part"},
+    )
+    return [proclitic, host]
+
+
+def _split_one(r: RawToken) -> list[RawToken]:
+    text = r.raw
+    # Rule 2: single-letter proclitic + article (e.g. والـ, بالـ).
+    # len > 3: proclitic + article (ال) + at least one host char
+    if (
+        len(text) > 3
+        and text[0] in _PROCLITIC_LETTERS
+        and text[1:3] == ARTICLE
+    ):
+        return _emit_split(r, 1)
+    # Rule 3: li- + article with elided alif (للـ).
+    if len(text) > 2 and text[0] == LAM and text[1] == LAM:
+        return _emit_split(r, 1)
+    # Rule 1: bare definite article.
+    # len > 2: article (ال) + at least one host char
+    if len(text) > 2 and text[:2] == ARTICLE:
+        return _emit_split(r, 2)
+    return [r]
+
+
+def split_proclitics(raws: list[RawToken]) -> list[RawToken]:
+    """Conservatively split Arabic proclitics off host words.
+
+    Splits only on unambiguous signals: the definite article, single-letter
+    proclitics that precede the article, and the li-+article (لل) contraction.
+    Bare proclitic letters before a non-article host are left attached to
+    avoid amputating word-initial radicals.
+    """
+    out: list[RawToken] = []
+    for r in raws:
+        out.extend(_split_one(r))
+    return out
diff --git a/src/tracealign/model.py b/src/tracealign/model.py
@@ -22,6 +22,8 @@ class Reason(str, Enum):
     PLENE_DEFECTIVE = "plene_defective"
     ABBREVIATION = "abbreviation"
     ORTHOGRAPHIC = "orthographic"
+    DIACRITICS_STRIPPED = "diacritics_stripped"
+    ORTHOGRAPHIC_VARIANT = "orthographic_variant"
     SCRIPT_VARIANT = "script_variant"
     INSERTION = "insertion"
     OMISSION = "omission"

diff --git a/tests/lang/arabic/__init__.py b/tests/lang/arabic/__init__.py
diff --git a/tests/lang/arabic/test_normalize.py b/tests/lang/arabic/test_normalize.py
@@ -0,0 +1,42 @@
+from tracealign.lang.arabic.normalize import skeleton, strip_tashkil
+
+
+def test_strip_tashkil_removes_vowel_marks():
+    # kitAb with fatha+kasra+long-a marks -> bare consonantal skeleton text
+    vocalized = "كَتَبَ"  # k-fatha t-fatha b-fatha
+    assert strip_tashkil(vocalized) == "كتب"
+
+
+def test_strip_tashkil_removes_tatweel():
+    assert strip_tashkil("كــتــاب") == "كتاب"
+
+
+def test_strip_tashkil_removes_shadda_and_tanwin():
+    assert strip_tashkil("مُحَمَّدٌ") == "محمد"
+
+
+def test_skeleton_folds_alif_variants():
+    assert skeleton("أحمد") == "احمد"
+    assert skeleton("إسلام") == "اسلام"
+    assert skeleton("آدم") == "ادم"
+
+
+def test_skeleton_folds_taa_marbuta_to_haa():
+    assert skeleton("مدينة") == "مدينه"
+
+
+def test_skeleton_folds_alif_maqsura_to_ya():
+    assert skeleton("على") == "علي"
+
+
+def test_skeleton_folds_hamza_seats():
+    assert skeleton("مؤمن") == "مومن"   # waw-hamza -> waw
+    assert skeleton("قائم") == "قايم"   # ya-hamza -> ya
+
+
+def test_skeleton_drops_bare_hamza():
+    assert skeleton("جزء") == "جز"
+
+
+def test_skeleton_is_idempotent_on_plain_text():
+    assert skeleton("كتاب") == "كتاب"