Skip to content
808 changes: 808 additions & 0 deletions docs/superpowers/plans/2026-06-24-arabic-language-pack.md

Large diffs are not rendered by default.

107 changes: 107 additions & 0 deletions docs/superpowers/specs/2026-06-24-arabic-language-pack-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Design: Arabic Language Pack (`ara`) — v0.1.0

**Date:** 2026-06-24
**Issue:** [#16](https://github.com/bsesic/trace/issues/16) — Arabic language pack (`ara`) — proclitic tokenization + orthographic normalization
**Roadmap:** First non-Hebrew language pack; prerequisite for the cross-lingual alignment path (#17), the clause/colon chunker (#18), and the Judaeo-Arabic transliteration helper (#20). Validates the `lang/base.py` / `register_language` abstraction on its first non-Hebrew exercise.

## Goal

Add an Arabic language pack so that `tokenize(text, lang="ara")` and `align(..., lang="ara")` work end-to-end, mirroring the structure of `src/tracealign/lang/hebrew/`. The pack handles proclitic segmentation and Arabic orthographic normalization through tiered scoring, staying rule-based and dependency-light (no CAMeL Tools / ML dependency), consistent with the project's stdlib-leaning ethos.

## Scope decisions (resolved during brainstorming)

1. **Proclitic splitting strategy: conservative / high-precision.** Split only where the signal is unambiguous. Over-splitting damages alignment (spurious tokens mis-align); under-splitting is recoverable by the fuzzy tier. Precision over recall. This choice also means **no curated guard lexicon is needed**.
2. **Reason vocabulary: granular — two new `Reason` values** (`DIACRITICS_STRIPPED`, `ORTHOGRAPHIC_VARIANT`). Each apparatus reason stays crisp for later critical-edition generation (Stage 5): an orthographic variant looks different from a fuzzy guess.

## 1. Package structure

Mirrors `lang/hebrew/`:

```
src/tracealign/lang/arabic/
__init__.py # register_language(ArabicLanguagePack()) — side-effect import
pack.py # ArabicLanguagePack(LanguagePack)
tokenize.py # split_proclitics() — post_tokenize hook
normalize.py # strip_tashkil(), skeleton()
scoring.py # arabic_scoring_tiers()
```

Registration: add `import tracealign.lang.arabic` alongside the Hebrew side-effect import in `src/tracealign/__init__.py` (currently line 37) and add `"tracealign.lang.arabic"` to `_BUILTIN_PACK_MODULES` (line 40) so the test-isolation reload helper restores it.

**No `data/` directory.** The conservative splitting strategy requires no guard lexicon, so the pack sets `self.lexica = Lexica()` (empty). This is a deliberate consequence of decision (1) — no unused lexicon scaffolding is created (YAGNI).

## 2. Pack metadata

- `code = "ara"`
- `aliases = ("arabic",)`
- `version = "ara-0.1.0"`
- `mid_word_chars = ""` — Arabic letters are Unicode category `L`, so the generic `pretokenize` handles them; `_DEFAULT_PUNCT` already contains the Arabic punctuation `،؛`.

`version` is surfaced automatically: `align()` writes `language_pack_version: pack.version` into the result `params` via `needleman_wunsch.py:376`. No aligner change needed.

## 3. Tokenization — `split_proclitics()` (post_tokenize)

Arabic proclitics attach with **no separator character** (unlike the Hebrew maqqef). Spans are therefore contiguous: the cursor advances by `len(part)` with no `+1` gap between parts.

Conservative rules — split only on unambiguous signals:

| Input | Split | Rule |
|---|---|---|
| الكتاب | `ال` ǀ `كتاب` | Article `ال` (alif-lam) + remainder, when `len(token) > 2` |
| والكتاب | `و` ǀ `الكتاب` | Single-letter proclitic (و/ف/ب/ك) **only when immediately followed by the article `ال`** |
| بالبيت | `ب` ǀ `البيت` | same as above |
| للكتاب | `ل` ǀ `لكتاب` | Special case: `li-` + article, alif elided (`لل`) → strip first `ل`, host keeps the reduced article form `لكتاب` |
| وكتاب | — | bare و + radical → **no split** |
| وزير، باب، كتاب | — | radical-initial → **no split** |

Flags: the proclitic part gets flag `proclitic`; the host part gets `compound_part` (mirroring Hebrew's compound flag). The `لل` special case is covered by an explicit test.

**Decision recorded:** the `لل` case strips the first `ل` and leaves the host as `لكتاب` (reduced-article form). We do not attempt to restore the elided alif; downstream scoring treats `لكتاب` as the host token's `raw`.

## 4. Normalization

- `raw` = diplomatic form (with tashkil), preserved unchanged.
- `text` = NFC → remove combining marks (category `Mn`: fatha, kasra, damma, sukun, shadda, tanwin) **and** remove tatweel `ـ` (U+0640, category `Lm`, decorative elongation — stripped explicitly since it is not a combining mark).
- `representations["skeleton"]` = orthographic folding applied on top of `text`:
- Alif variants: `أ إ آ ٱ → ا`
- Taa marbuta: `ة → ه`
- Alif maqsura / final ya: `ى → ي` (one canonical direction)
- Hamza seats: `ؤ → و`, `ئ → ي`, bare `ء` removed

## 5. Scoring tiers — `arabic_scoring_tiers()`

Enum extension in `src/tracealign/model.py`: add `DIACRITICS_STRIPPED` and `ORTHOGRAPHIC_VARIANT` to `Reason`.

| Tier | Predicate | Score | Reason | `details.layer` |
|---|---|---|---|---|
| 1 | `a.raw == b.raw` | 1.0 | `EXACT` | — |
| 2 | `a.text == b.text` ∧ `a.raw != b.raw` | 0.95 | `DIACRITICS_STRIPPED` | — |
| 3 | `skeleton == skeleton` ∧ `a.text != b.text` | 0.90 | `ORTHOGRAPHIC_VARIANT` | `"skeleton"` |
| 4 | `rapidfuzz.fuzz.ratio / 100 ≥ 0.6` | `ratio * 0.9` | `ORTHOGRAPHIC` | `"fuzzy"` |

Tier predicates mirror `lang/hebrew/scoring.py` in shape and return `TierResult`. Score constants mirror the Hebrew ladder (0.95 / 0.90 / scaled fuzzy). No `ABBREVIATION` tier: Arabic abbreviation handling is out of scope for v0.1.0 (no abbreviation lexicon).

## 6. Tests (TDD — written red first)

- **tokenize:** every split case in §3, including the **negative cases** (وكتاب، وزير، باب must NOT split) and the `لل` special case; span correctness for contiguous parts.
- **normalize:** tashkil stripping, tatweel removal, each folding rule individually; `raw` remains the diplomatic form.
- **scoring:** one hit per tier with the correct `Reason` tag and `details.layer` where applicable.
- **registry:** `list_languages()` includes `"ara"`; `get_language("arabic")` resolves via alias.
- **end-to-end:** `tokenize(t, lang="ara")` and `align(a, b, lang="ara")` run; `params["language_pack_version"] == "ara-0.1.0"`.
- Full suite green on the 3.10 / 3.11 / 3.12 matrix; `flake8 src/ tests/` clean.

## 7. Out of scope (per issue #16)

- Clause/colon boundary particle inventory → issue #18 (chunker).
- Judaeo-Arabic written in Hebrew script → issue #20 (transliteration helper).
- Any cross-lingual scoring → issue #17.
- Syriac (`syr`) and Persian (`fas`) packs → separate follow-on issues once this pack lands and the abstraction is proven.

## Acceptance criteria (from issue #16)

- [ ] `list_languages()` includes `ara`; `tokenize`/`align` with `lang="ara"` work end-to-end.
- [ ] Proclitic split separates `wa-`/`fa-`/`al-` etc. and does **not** split radical `w`/`f` (targeted tests).
- [ ] Orthographic normalization collapses alif/hamza/taa-marbuta/ya variants into a skeleton; diplomatic form preserved in `raw`.
- [ ] Tiered scoring returns reason tags consistent with the `Reason` enum (extended with two Arabic-relevant, script-neutral reasons, justified above).
- [ ] Tests follow TDD; full suite green on 3.10/3.11/3.12; `flake8` clean.
- [ ] `pack.version` set (`ara-0.1.0`) and surfaced in result `params`.
8 changes: 6 additions & 2 deletions src/tracealign/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,15 @@

__version__ = "0.4.0.dev0"

# Force Hebrew pack registration on first import.
# Force Hebrew and Arabic pack registration on first import.
import tracealign.lang.hebrew # noqa: F401 -- side effect: registers HBO pack
import tracealign.lang.arabic # noqa: F401 -- side effect: registers ARA pack

# Built-in pack module names; used to restore registrations after test resets.
_BUILTIN_PACK_MODULES = ("tracealign.lang.hebrew",)
_BUILTIN_PACK_MODULES = (
"tracealign.lang.hebrew",
"tracealign.lang.arabic",
)


def _reload_builtin_packs() -> None:
Expand Down
6 changes: 6 additions & 0 deletions src/tracealign/lang/arabic/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Arabic language pack — auto-registers on import."""

from tracealign.lang.arabic.pack import ArabicLanguagePack
from tracealign.lang.registry import register_language

register_language(ArabicLanguagePack())
37 changes: 37 additions & 0 deletions src/tracealign/lang/arabic/normalize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
"""Arabic normalization: tashkil strip and orthographic skeleton folding."""

from __future__ import annotations

import unicodedata

TATWEEL = "ـ" # ARABIC TATWEEL (kashida), decorative elongation

# Orthographic folding table applied on top of a tashkil-free string.
# Alif variants -> bare alif; taa marbuta -> haa; alif maqsura -> ya;
# hamza seats -> their carrier letter; bare hamza dropped.
_FOLD = {
"أ": "ا", # ALEF WITH HAMZA ABOVE أ -> ا
"إ": "ا", # ALEF WITH HAMZA BELOW إ -> ا
"آ": "ا", # ALEF WITH MADDA ABOVE آ -> ا
"ٱ": "ا", # ALEF WASLA ٱ -> ا
"ة": "ه", # TEH MARBUTA ة -> ه
"ى": "ي", # ALEF MAKSURA ى -> ي
"ؤ": "و", # WAW WITH HAMZA ؤ -> و
"ئ": "ي", # YEH WITH HAMZA ئ -> ي
"ء": "", # HAMZA ء -> (dropped)
}


def strip_tashkil(text: str) -> str:
"""NFC-normalize, then remove combining marks (Mn) and tatweel."""
text = unicodedata.normalize("NFC", text)
return "".join(
ch
for ch in text
if unicodedata.category(ch) != "Mn" and ch != TATWEEL
)


def skeleton(text_no_tashkil: str) -> str:
"""Apply orthographic folding to a tashkil-free string."""
return "".join(_FOLD.get(ch, ch) for ch in text_no_tashkil)
45 changes: 45 additions & 0 deletions src/tracealign/lang/arabic/pack.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
"""ArabicLanguagePack."""

from __future__ import annotations

from tracealign.lang.arabic.normalize import skeleton, strip_tashkil
from tracealign.lang.arabic.tokenize import split_proclitics
from tracealign.lang.base import LanguagePack, ScoringTier
from tracealign.model import Lexica, Token
from tracealign.tokenize.base import RawToken


class ArabicLanguagePack(LanguagePack):
code = "ara"
aliases = ("arabic",)
version = "ara-0.1.0"
word_chars = ""
mid_word_chars = ""

def __init__(self, lexica: Lexica | None = None) -> None:
# Conservative splitting needs no guard lexicon; an empty Lexica is
# intentional (see design spec).
self.lexica = lexica if lexica is not None else Lexica()

def post_tokenize(self, raws: list[RawToken]) -> list[RawToken]:
return split_proclitics(raws)

def normalize(self, raw: RawToken) -> Token:
# `id` and `position` are pack-local placeholders derived from the raw
# character span; the public `tokenize()` overrides both with
# sequence-index values keyed by `seq_label`.
text = strip_tashkil(raw.raw)
return Token(
id=f"ara:{raw.span[0]:06d}",
position=raw.span[0],
raw=raw.raw,
text=text,
representations={"skeleton": skeleton(text)},
flags=set(raw.flags),
source_span=raw.span,
metadata={},
)

def scoring_tiers(self) -> list[ScoringTier]:
from tracealign.lang.arabic.scoring import arabic_scoring_tiers
return arabic_scoring_tiers()
62 changes: 62 additions & 0 deletions src/tracealign/lang/arabic/scoring.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
"""Arabic scoring-tier predicates."""

from __future__ import annotations

from rapidfuzz.fuzz import ratio

from tracealign.lang.base import LanguagePack, ScoringTier, TierResult
from tracealign.model import Reason, Token


def exact_predicate(a: Token, b: Token, pack: LanguagePack) -> TierResult | None:
if a.raw == b.raw:
return TierResult(score=1.0)
return None


def diacritics_stripped_predicate(
a: Token, b: Token, pack: LanguagePack
) -> TierResult | None:
if a.text == b.text and a.raw != b.raw:
return TierResult(score=0.95)
return None


def orthographic_variant_predicate(
a: Token, b: Token, pack: LanguagePack
) -> TierResult | None:
sk_a = a.representations.get("skeleton")
sk_b = b.representations.get("skeleton")
if sk_a is None or sk_b is None:
return None
if sk_a == sk_b and a.text != b.text:
return TierResult(score=0.90, details={"layer": "skeleton"})
return None


def orthographic_predicate(
a: Token,
b: Token,
pack: LanguagePack,
*,
threshold: float = 0.6,
) -> TierResult | None:
r = ratio(a.text, b.text) / 100.0
if r < threshold:
return None
return TierResult(score=r * 0.9, details={"layer": "fuzzy", "ratio": r})


def arabic_scoring_tiers() -> list[ScoringTier]:
return [
ScoringTier(reason=Reason.EXACT, predicate=exact_predicate),
ScoringTier(
reason=Reason.DIACRITICS_STRIPPED,
predicate=diacritics_stripped_predicate,
),
ScoringTier(
reason=Reason.ORTHOGRAPHIC_VARIANT,
predicate=orthographic_variant_predicate,
),
ScoringTier(reason=Reason.ORTHOGRAPHIC, predicate=orthographic_predicate),
]
65 changes: 65 additions & 0 deletions src/tracealign/lang/arabic/tokenize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
"""Arabic-specific tokenizer hooks: conservative proclitic splitting."""

from __future__ import annotations

from tracealign.tokenize.base import RawToken

ALEF = "ا"
LAM = "ل"
ARTICLE = ALEF + LAM # ال
# Single-letter proclitics that we split only when they precede the article.
_PROCLITIC_LETTERS = ("و", "ف", "ب", "ك") # و ف ب ك


def _emit_split(r: RawToken, cut: int) -> list[RawToken]:
"""Split RawToken `r` at character offset `cut` into proclitic + host.

Spans are contiguous (no separator character between Arabic proclitic
and host).
"""
start = r.span[0]
proclitic = RawToken(
raw=r.raw[:cut],
span=(start, start + cut),
flags=set(r.flags) | {"proclitic"},
)
host = RawToken(
raw=r.raw[cut:],
span=(start + cut, r.span[1]),
flags=set(r.flags) | {"compound_part"},
)
return [proclitic, host]


def _split_one(r: RawToken) -> list[RawToken]:
text = r.raw
# Rule 2: single-letter proclitic + article (e.g. والـ, بالـ).
# len > 3: proclitic + article (ال) + at least one host char
if (
len(text) > 3
and text[0] in _PROCLITIC_LETTERS
and text[1:3] == ARTICLE
):
return _emit_split(r, 1)
# Rule 3: li- + article with elided alif (للـ).
if len(text) > 2 and text[0] == LAM and text[1] == LAM:
return _emit_split(r, 1)
# Rule 1: bare definite article.
# len > 2: article (ال) + at least one host char
if len(text) > 2 and text[:2] == ARTICLE:
return _emit_split(r, 2)
return [r]


def split_proclitics(raws: list[RawToken]) -> list[RawToken]:
"""Conservatively split Arabic proclitics off host words.

Splits only on unambiguous signals: the definite article, single-letter
proclitics that precede the article, and the li-+article (لل) contraction.
Bare proclitic letters before a non-article host are left attached to
avoid amputating word-initial radicals.
"""
out: list[RawToken] = []
for r in raws:
out.extend(_split_one(r))
return out
2 changes: 2 additions & 0 deletions src/tracealign/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ class Reason(str, Enum):
PLENE_DEFECTIVE = "plene_defective"
ABBREVIATION = "abbreviation"
ORTHOGRAPHIC = "orthographic"
DIACRITICS_STRIPPED = "diacritics_stripped"
ORTHOGRAPHIC_VARIANT = "orthographic_variant"
SCRIPT_VARIANT = "script_variant"
INSERTION = "insertion"
OMISSION = "omission"
Expand Down
Empty file added tests/lang/arabic/__init__.py
Empty file.
42 changes: 42 additions & 0 deletions tests/lang/arabic/test_normalize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from tracealign.lang.arabic.normalize import skeleton, strip_tashkil


def test_strip_tashkil_removes_vowel_marks():
# kitAb with fatha+kasra+long-a marks -> bare consonantal skeleton text
vocalized = "كَتَبَ" # k-fatha t-fatha b-fatha
assert strip_tashkil(vocalized) == "كتب"


def test_strip_tashkil_removes_tatweel():
assert strip_tashkil("كــتــاب") == "كتاب"


def test_strip_tashkil_removes_shadda_and_tanwin():
assert strip_tashkil("مُحَمَّدٌ") == "محمد"


def test_skeleton_folds_alif_variants():
assert skeleton("أحمد") == "احمد"
assert skeleton("إسلام") == "اسلام"
assert skeleton("آدم") == "ادم"


def test_skeleton_folds_taa_marbuta_to_haa():
assert skeleton("مدينة") == "مدينه"


def test_skeleton_folds_alif_maqsura_to_ya():
assert skeleton("على") == "علي"


def test_skeleton_folds_hamza_seats():
assert skeleton("مؤمن") == "مومن" # waw-hamza -> waw
assert skeleton("قائم") == "قايم" # ya-hamza -> ya


def test_skeleton_drops_bare_hamza():
assert skeleton("جزء") == "جز"


def test_skeleton_is_idempotent_on_plain_text():
assert skeleton("كتاب") == "كتاب"
Loading
Loading