-
Notifications
You must be signed in to change notification settings - Fork 114
feat: idoctags serialization and deserialization matching the iso proposal #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
✅ DCO Check Passed Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉 |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
…cTags Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
| | # | Category | Token | Self-Closing [Yes/No] | Parametrized [Yes/No] | Attributes | Description | | ||
| |---|----------|-------|-----------------------|-----------------------|------------|-------------| | ||
| | 1 | Root Elements | `doctag` | No | Yes | `version` | Root container; optional semantic version `version`. | | ||
| | 2 | Special Elements | `page_break` | Yes | No | — | Page delimiter. | | ||
| | 3 | | `time_break` | Yes | No | — | Temporal segment delimiter. | | ||
| | 4 | Metadata Containers | `head` | No | No | — | Document-level metadata container. | | ||
| | 5 | | `meta` | No | No | — | Component-level metadata container. | | ||
| | 6 | Geometric Tokens | `location` | Yes | Yes | `value`, `resolution?` | | ||
| Geometric coordinate; `value` in [0, res]; optional `resolution`. | | ||
| | 7 | Temmporal Tokens | `hour` | Yes | Yes | `value` | Hours component; `value` in [0, 99]. | | ||
| | 8 | | `minute` | Yes | Yes | `value` | Minutes component; `value` in [0, 59]. | | ||
| | 9 | | `second` | Yes | Yes | `value` | Seconds component; `value` in [0, 59]. | | ||
| | 10 | | `centisecond` | Yes | Yes | `value` | Centiseconds component; `value` in [0, 99]. | | ||
| | 11 | Semantic Tokens | `title` | No | No | — | Document or section title (content). | | ||
| | 12 | | `heading` | No | Yes | `level` | Section header; `level` (N ≥ 1). | | ||
| | 13 | | `text` | No | No | — | Generic text content. | | ||
| | 14 | | `caption` | No | No | — | Caption for floating/grouped elements. | | ||
| | 15 | | `footnote` | No | No | — | Footnote content. | | ||
| | 16 | | `page_header` | No | No | — | Page header content. | | ||
| | 17 | | `page_footer` | No | No | — | Page footer content. | | ||
| | 18 | | `watermark` | No | No | — | Watermark indicator or content. | | ||
| | 19 | | `picture` | No | No | — | | ||
| Block image/graphic; at most one of `base64`/`uri`; may include `meta` | ||
| for classification; `otsl` may encode chart data. | | ||
| | 20 | | `form` | No | No | — | Form structure container. | | ||
| | 21 | | `formula` | No | No | — | Mathematical expression block. | | ||
| | 22 | | `code` | No | No | — | Code block. | | ||
| | 23 | | `list_text` | No | No | — | List item content. | | ||
| | 24 | | `checkbox` | No | Yes | `selected` | | ||
| Checkbox item; optional `selected` in {`true`,`false`} defaults to | ||
| `false`. | | ||
| | 25 | | `form_item` | No | No | — | | ||
| Form item; exactly one `key`; one or more of | ||
| `value`/`checkbox`/`marker`/`hint`. | | ||
| | 26 | | `form_heading` | No | Yes | `level?` | Form header; optional `level` (N ≥ 1). | | ||
| | 27 | | `form_text` | No | No | — | Form text block. | | ||
| | 28 | | `hint` | No | No | — | Hint for a fillable field (format/example/description). | | ||
| | 29 | Grouping Tokens | `section` | No | Yes | `level` | Document section; `level` (N ≥ 1). | | ||
| | 30 | | `list` | No | Yes | `ordered` | | ||
| List container; optional `ordered` in {`true`,`false`} defaults to | ||
| `false`. | | ||
| | 31 | | `group` | No | Yes | `type?` | | ||
| Generic group; no `location` tokens; associates composite content | ||
| (e.g., captions/footnotes). | | ||
| | 32 | | `floating_group` | No | Yes | `class` in {`table`,`picture`,`form`,`code`} | | ||
| Floating container that groups a floating component with its caption, | ||
| footnotes, and metadata; no `location` tokens. | | ||
| | 33 | Formatting Tokens | `bold` | No | No | — | Bold text. | | ||
| | 34 | | `italic` | No | No | — | Italic text. | | ||
| | 35 | | `strikethrough` | No | No | — | Strike-through text. | | ||
| | 36 | | `superscript` | No | No | — | Superscript text. | | ||
| | 37 | | `subscript` | No | No | — | Subscript text. | | ||
| | 38 | | `rtl` | No | No | — | Right-to-left text direction. | | ||
| | 39 | | `inline` | No | Yes | `class` in {`formula`,`code`,`picture`} | | ||
| Inline content; if `class="picture"`, may include one of `base64` or | ||
| `uri`. | | ||
| | 40 | | `br` | Yes | No | — | Line break. | | ||
| | 41 | Structural Tokens (OTSL) | `otsl` | No | No | — | Table structure container. | | ||
| | 42 | | `fcel` | Yes | No | — | New cell with content. | | ||
| | 43 | | `ecel` | Yes | No | — | New cell without content. | | ||
| | 44 | | `ched` | Yes | No | — | Column header cell. | | ||
| | 45 | | `rhed` | Yes | No | — | Row header cell. | | ||
| | 46 | | `corn` | Yes | No | — | Corner header cell. | | ||
| | 47 | | `srow` | Yes | No | — | Section row separator cell. | | ||
| | 48 | | `lcel` | Yes | No | — | Merge with left neighbor (horizontal span). | | ||
| | 49 | | `ucel` | Yes | No | — | Merge with upper neighbor (vertical span). | | ||
| | 50 | | `xcel` | Yes | No | — | Merge with left and upper neighbors (2D span). | | ||
| | 51 | | `nl` | Yes | No | — | New line (row separator). | | ||
| | 52 | Continuation Tokens | `thread` | Yes | Yes | `id` | | ||
| Continuation marker for split content; reuse same `id` across parts. | | ||
| | 53 | | `h_thread` | Yes | Yes | `id` | Horizontal stitching marker for split tables; reuse same `id`. | | ||
| | 54 | Binary Data Tokens | `base64` | No | No | — | Embedded binary data (base64). | | ||
| | 55 | | `uri` | No | No | — | External resource reference. | | ||
| | 56 | Content Tokens | `marker` | No | No | — | List/form marker content. | | ||
| | 57 | | `facets` | No | No | — | Container for application-specific derived properties. | | ||
| | 58 | Structural Tokens (Form) | `key` | No | No | — | Form item key (child of `form_item`). | | ||
| | 59 | | `value` | No | No | — | Form item value (child of `form_item`). | | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the table from the code itself.
Later, one can just add a link/reference to the actual standard table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update with the latest new table, but it gives a lot of good context for the code generators.
| tbl_provs = self._extract_provenance(doc=doc, el=otsl_el) | ||
| inner = self._inner_xml(otsl_el) | ||
| # Remove any location tokens from the OTSL content before parsing | ||
| inner = re.sub(r"<\s*location\b[^>]*/\s*>", "", inner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use regex when we already have the parsed content in el?
Applies to other regex usage occurrences too.
| ) | ||
| for p in prov_list[1:]: | ||
| item.prov.append(p) | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a single explicit if ... elif ... else would be more readable than the various if ... returns.
| nm_child = node.tagName | ||
| if nm_child == IDocTagsToken.FACETS.value: | ||
| facets_text = self._get_text(node).strip() | ||
| if "=" in facets_text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have the parsed elements we shouldn't have to manually parse text again.
| """IDocTagsVocabulary.""" | ||
|
|
||
| # Allowed attributes per token (defined outside the Enum to satisfy mypy) | ||
| ALLOWED_ATTRIBUTES: ClassVar[dict[IDocTagsToken, set["IDocTagsAttributeKey"]]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if something more Pydantic-like (e.g. pydantic-xml) could help better organize the various validation rules.
If we keep these low-level rules for now, at least I would make them internal e.g. _ALLOWED_ATTRIBUTES.
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
IDocTags Serialization Implementation
Overview
Implements bidirectional serialization between DoclingDocument and IDocTags format—a specialized XML-based markup language for structured document representation with geometric and semantic annotations.
Serialization Features
Core Capabilities:
Current Test Coverage
Outstanding Work (FIXMEs)
<thread id="int">and page-breaks. This will likely need some updates to the BaseSerializer. As such, I want to not include it in this PR.Testing
Dump Mode Usage
Serialize DoclingDocuments from HuggingFace datasets to IDocTags format and generate a validation report:
What it does:
Config file:
If --config is omitted, a default config (idoctags_dump_config.json) is auto-generated. Key settings: dataset_name, dataset_subset, output_dir, report_path, limit.
Use --write-default-config to generate the config template without running the dump.
The result of,
is