feat: idoctags serialization and deserialization matching the iso proposal #457

PeterStaar-IBM · 2025-12-11T06:58:09Z

IDocTags Serialization Implementation

Overview

Implements bidirectional serialization between DoclingDocument and IDocTags format—a specialized XML-based markup language for structured document representation with geometric and semantic annotations.

Serialization Features

Core Capabilities:

Two output modes: HUMAN_FRIENDLY (indented, readable) and LLM_FRIENDLY (compact, tokenizer-optimized)
Geometric encoding: Bounding boxes quantized to configurable resolution (default: 512×512) via tokens
Semantic markup: Rich vocabulary covering titles, headings, text, captions, lists, forms, tables, pictures, code, formulas
OTSL table structure: Optimized Table Sequence Language with cell types (fcel/ecel, ched/rhed/corn, lcel/ucel/xcel, nl)
Content control: add_content parameter allows structure-only serialization (omits text while preserving tags)
XML compliance mode: Optional escaping of special characters (&, <, >) for valid XML output
Deserialization: Round-trip support to reconstruct DoclingDocument from IDocTags

Current Test Coverage

Vocabulary helpers (create_closing_token validation, self-closing token handling)
Content suppression correctness (captions, table cells, list items)
Multi-mode serialization (with/without content, human/LLM-friendly)
Metadata serialization and XML escaping
Round-trip edge cases

Outstanding Work (FIXMEs)

Inline groups with multi-provenance (idoctags.py:1093) — Current split-per-provenance logic may need enhancement for inline groups
Checkbox token generation (lines 1174, 1179) — Add dedicated create_selected_token() method to vocabulary
Catch-all label handling (lines 1183-1184) — Refine mapping for EMPTY_VALUE, HANDWRITTEN_TEXT, PARAGRAPH, etc.; EMPTY_VALUE may need FormItem representation
OTSL cell logic verification (lines 1564, 1570) — Validate rowstart/colstart conditions for UCEL/LCEL tokens
TABLE vs DOCUMENT_INDEX distinction (line 1602) — Check label to emit correct floating group type
We still need to do the FormItems
we still need to add the footnotes to FloatingItem (captions are taken care of)
we still need to take care of the <thread id="int"> and page-breaks. This will likely need some updates to the BaseSerializer. As such, I want to not include it in this PR.

Testing

Dump Mode Usage

Serialize DoclingDocuments from HuggingFace datasets to IDocTags format and generate a validation report:

python examples/convert_to_idoctags.py --mode dump [--config CONFIG.json] [--limit N]

What it does:

Loads documents from a HuggingFace dataset (default: docling-project/doclaynet-set-a)
Serializes them to IDocTags in multiple variants (LLM/human-friendly, with/without content, XML-compliant)
Generates an Excel report (./scratch/idoctags_report.xlsx) tracking success/failure for each serialization mode
Optionally limits processing to first N documents with --limit

Config file:

If --config is omitted, a default config (idoctags_dump_config.json) is auto-generated. Key settings: dataset_name, dataset_subset, output_dir, report_path, limit.

Use --write-default-config to generate the config template without running the dump.

The result of,

uv run python ./examples/convert_to_idoctags.py --mode dump --config ./idoctags_dump_config.json

is

Wrote report (Excel via pandas) to: scratch/idoctags_report.xlsx
Overview summary:
 - Total processed: 3544
 - Loaded DoclingDocument: 3544
 - Serialized IDocTags (human_friendly, xml_compliant=True, content=True): 3529
 - Serialized IDocTags (human_friendly, xml_compliant=True, content=False): 3544
 - Serialized IDocTags (human_friendly, xml_compliant=False, content=True): 3034
 - Serialized IDocTags (human_friendly, xml_compliant=False, content=False): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=True, content=True): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=True, content=False): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=False, content=True): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=False, content=False): 3544
 - Serialized HTML: 3541

Signed-off-by: Peter Staar <[email protected]>

mergify · 2025-12-11T06:58:18Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

github-actions · 2025-12-11T06:58:18Z

✅ DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

codecov · 2025-12-11T06:59:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Signed-off-by: Peter Staar <[email protected]>

dosubot · 2025-12-12T16:26:17Z

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}