Skip to content

Conversation

@PeterStaar-IBM
Copy link
Member

@PeterStaar-IBM PeterStaar-IBM commented Dec 11, 2025

IDocTags Serialization Implementation

Overview

Implements bidirectional serialization between DoclingDocument and IDocTags format—a specialized XML-based markup language for structured document representation with geometric and semantic annotations.

Serialization Features

Core Capabilities:

  • Two output modes: HUMAN_FRIENDLY (indented, readable) and LLM_FRIENDLY (compact, tokenizer-optimized)
  • Geometric encoding: Bounding boxes quantized to configurable resolution (default: 512×512) via tokens
  • Semantic markup: Rich vocabulary covering titles, headings, text, captions, lists, forms, tables, pictures, code, formulas
  • OTSL table structure: Optimized Table Sequence Language with cell types (fcel/ecel, ched/rhed/corn, lcel/ucel/xcel, nl)
  • Content control: add_content parameter allows structure-only serialization (omits text while preserving tags)
  • XML compliance mode: Optional escaping of special characters (&, <, >) for valid XML output
  • Deserialization: Round-trip support to reconstruct DoclingDocument from IDocTags

Current Test Coverage

  • Vocabulary helpers (create_closing_token validation, self-closing token handling)
  • Content suppression correctness (captions, table cells, list items)
  • Multi-mode serialization (with/without content, human/LLM-friendly)
  • Metadata serialization and XML escaping
  • Round-trip edge cases

Outstanding Work (FIXMEs)

  1. Inline groups with multi-provenance (idoctags.py:1093) — Current split-per-provenance logic may need enhancement for inline groups
  2. Checkbox token generation (lines 1174, 1179) — Add dedicated create_selected_token() method to vocabulary
  3. Catch-all label handling (lines 1183-1184) — Refine mapping for EMPTY_VALUE, HANDWRITTEN_TEXT, PARAGRAPH, etc.; EMPTY_VALUE may need FormItem representation
  4. OTSL cell logic verification (lines 1564, 1570) — Validate rowstart/colstart conditions for UCEL/LCEL tokens
  5. TABLE vs DOCUMENT_INDEX distinction (line 1602) — Check label to emit correct floating group type
  6. We still need to do the FormItems
  7. we still need to add the footnotes to FloatingItem (captions are taken care of)
  8. we still need to take care of the <thread id="int"> and page-breaks. This will likely need some updates to the BaseSerializer. As such, I want to not include it in this PR.

Testing

Dump Mode Usage

Serialize DoclingDocuments from HuggingFace datasets to IDocTags format and generate a validation report:

python examples/convert_to_idoctags.py --mode dump [--config CONFIG.json] [--limit N]

What it does:

  • Loads documents from a HuggingFace dataset (default: docling-project/doclaynet-set-a)
  • Serializes them to IDocTags in multiple variants (LLM/human-friendly, with/without content, XML-compliant)
  • Generates an Excel report (./scratch/idoctags_report.xlsx) tracking success/failure for each serialization mode
  • Optionally limits processing to first N documents with --limit

Config file:

If --config is omitted, a default config (idoctags_dump_config.json) is auto-generated. Key settings: dataset_name, dataset_subset, output_dir, report_path, limit.

Use --write-default-config to generate the config template without running the dump.

The result of,

uv run python ./examples/convert_to_idoctags.py --mode dump --config ./idoctags_dump_config.json

is

Wrote report (Excel via pandas) to: scratch/idoctags_report.xlsx
Overview summary:
 - Total processed: 3544
 - Loaded DoclingDocument: 3544
 - Serialized IDocTags (human_friendly, xml_compliant=True, content=True): 3529
 - Serialized IDocTags (human_friendly, xml_compliant=True, content=False): 3544
 - Serialized IDocTags (human_friendly, xml_compliant=False, content=True): 3034
 - Serialized IDocTags (human_friendly, xml_compliant=False, content=False): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=True, content=True): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=True, content=False): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=False, content=True): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=False, content=False): 3544
 - Serialized HTML: 3541

@mergify
Copy link

mergify bot commented Dec 11, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 11, 2025

DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

@codecov
Copy link

codecov bot commented Dec 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@PeterStaar-IBM PeterStaar-IBM marked this pull request as ready for review December 12, 2025 16:25
@dosubot
Copy link

dosubot bot commented Dec 12, 2025

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@PeterStaar-IBM PeterStaar-IBM changed the title feat: aling idoctags to iso feat: align idoctags to iso Dec 15, 2025
@dolfim-ibm dolfim-ibm changed the title feat: align idoctags to iso feat: idoctags serialization and deserialization matching the iso proposal Dec 16, 2025
Comment on lines +221 to +298
| # | Category | Token | Self-Closing [Yes/No] | Parametrized [Yes/No] | Attributes | Description |
|---|----------|-------|-----------------------|-----------------------|------------|-------------|
| 1 | Root Elements | `doctag` | No | Yes | `version` | Root container; optional semantic version `version`. |
| 2 | Special Elements | `page_break` | Yes | No | — | Page delimiter. |
| 3 | | `time_break` | Yes | No | — | Temporal segment delimiter. |
| 4 | Metadata Containers | `head` | No | No | — | Document-level metadata container. |
| 5 | | `meta` | No | No | — | Component-level metadata container. |
| 6 | Geometric Tokens | `location` | Yes | Yes | `value`, `resolution?` |
Geometric coordinate; `value` in [0, res]; optional `resolution`. |
| 7 | Temmporal Tokens | `hour` | Yes | Yes | `value` | Hours component; `value` in [0, 99]. |
| 8 | | `minute` | Yes | Yes | `value` | Minutes component; `value` in [0, 59]. |
| 9 | | `second` | Yes | Yes | `value` | Seconds component; `value` in [0, 59]. |
| 10 | | `centisecond` | Yes | Yes | `value` | Centiseconds component; `value` in [0, 99]. |
| 11 | Semantic Tokens | `title` | No | No | — | Document or section title (content). |
| 12 | | `heading` | No | Yes | `level` | Section header; `level` (N ≥ 1). |
| 13 | | `text` | No | No | — | Generic text content. |
| 14 | | `caption` | No | No | — | Caption for floating/grouped elements. |
| 15 | | `footnote` | No | No | — | Footnote content. |
| 16 | | `page_header` | No | No | — | Page header content. |
| 17 | | `page_footer` | No | No | — | Page footer content. |
| 18 | | `watermark` | No | No | — | Watermark indicator or content. |
| 19 | | `picture` | No | No | — |
Block image/graphic; at most one of `base64`/`uri`; may include `meta`
for classification; `otsl` may encode chart data. |
| 20 | | `form` | No | No | — | Form structure container. |
| 21 | | `formula` | No | No | — | Mathematical expression block. |
| 22 | | `code` | No | No | — | Code block. |
| 23 | | `list_text` | No | No | — | List item content. |
| 24 | | `checkbox` | No | Yes | `selected` |
Checkbox item; optional `selected` in {`true`,`false`} defaults to
`false`. |
| 25 | | `form_item` | No | No | — |
Form item; exactly one `key`; one or more of
`value`/`checkbox`/`marker`/`hint`. |
| 26 | | `form_heading` | No | Yes | `level?` | Form header; optional `level` (N ≥ 1). |
| 27 | | `form_text` | No | No | — | Form text block. |
| 28 | | `hint` | No | No | — | Hint for a fillable field (format/example/description). |
| 29 | Grouping Tokens | `section` | No | Yes | `level` | Document section; `level` (N ≥ 1). |
| 30 | | `list` | No | Yes | `ordered` |
List container; optional `ordered` in {`true`,`false`} defaults to
`false`. |
| 31 | | `group` | No | Yes | `type?` |
Generic group; no `location` tokens; associates composite content
(e.g., captions/footnotes). |
| 32 | | `floating_group` | No | Yes | `class` in {`table`,`picture`,`form`,`code`} |
Floating container that groups a floating component with its caption,
footnotes, and metadata; no `location` tokens. |
| 33 | Formatting Tokens | `bold` | No | No | — | Bold text. |
| 34 | | `italic` | No | No | — | Italic text. |
| 35 | | `strikethrough` | No | No | — | Strike-through text. |
| 36 | | `superscript` | No | No | — | Superscript text. |
| 37 | | `subscript` | No | No | — | Subscript text. |
| 38 | | `rtl` | No | No | — | Right-to-left text direction. |
| 39 | | `inline` | No | Yes | `class` in {`formula`,`code`,`picture`} |
Inline content; if `class="picture"`, may include one of `base64` or
`uri`. |
| 40 | | `br` | Yes | No | — | Line break. |
| 41 | Structural Tokens (OTSL) | `otsl` | No | No | — | Table structure container. |
| 42 | | `fcel` | Yes | No | — | New cell with content. |
| 43 | | `ecel` | Yes | No | — | New cell without content. |
| 44 | | `ched` | Yes | No | — | Column header cell. |
| 45 | | `rhed` | Yes | No | — | Row header cell. |
| 46 | | `corn` | Yes | No | — | Corner header cell. |
| 47 | | `srow` | Yes | No | — | Section row separator cell. |
| 48 | | `lcel` | Yes | No | — | Merge with left neighbor (horizontal span). |
| 49 | | `ucel` | Yes | No | — | Merge with upper neighbor (vertical span). |
| 50 | | `xcel` | Yes | No | — | Merge with left and upper neighbors (2D span). |
| 51 | | `nl` | Yes | No | — | New line (row separator). |
| 52 | Continuation Tokens | `thread` | Yes | Yes | `id` |
Continuation marker for split content; reuse same `id` across parts. |
| 53 | | `h_thread` | Yes | Yes | `id` | Horizontal stitching marker for split tables; reuse same `id`. |
| 54 | Binary Data Tokens | `base64` | No | No | — | Embedded binary data (base64). |
| 55 | | `uri` | No | No | — | External resource reference. |
| 56 | Content Tokens | `marker` | No | No | — | List/form marker content. |
| 57 | | `facets` | No | No | — | Container for application-specific derived properties. |
| 58 | Structural Tokens (Form) | `key` | No | No | — | Form item key (child of `form_item`). |
| 59 | | `value` | No | No | — | Form item value (child of `form_item`). |
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the table from the code itself.

Later, one can just add a link/reference to the actual standard table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update with the latest new table, but it gives a lot of good context for the code generators.

tbl_provs = self._extract_provenance(doc=doc, el=otsl_el)
inner = self._inner_xml(otsl_el)
# Remove any location tokens from the OTSL content before parsing
inner = re.sub(r"<\s*location\b[^>]*/\s*>", "", inner)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use regex when we already have the parsed content in el?

Applies to other regex usage occurrences too.

)
for p in prov_list[1:]:
item.prov.append(p)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a single explicit if ... elif ... else would be more readable than the various if ... returns.

nm_child = node.tagName
if nm_child == IDocTagsToken.FACETS.value:
facets_text = self._get_text(node).strip()
if "=" in facets_text:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have the parsed elements we shouldn't have to manually parse text again.

"""IDocTagsVocabulary."""

# Allowed attributes per token (defined outside the Enum to satisfy mypy)
ALLOWED_ATTRIBUTES: ClassVar[dict[IDocTagsToken, set["IDocTagsAttributeKey"]]] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if something more Pydantic-like (e.g. pydantic-xml) could help better organize the various validation rules.

If we keep these low-level rules for now, at least I would make them internal e.g. _ALLOWED_ATTRIBUTES.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants