Skip to content

fix(gmail): preserve HTML body when text/html part has Content-ID (Outlook)#683

Open
malob wants to merge 1 commit intogoogleworkspace:mainfrom
malob:fix/html-body-content-id
Open

fix(gmail): preserve HTML body when text/html part has Content-ID (Outlook)#683
malob wants to merge 1 commit intogoogleworkspace:mainfrom
malob:fix/html-body-content-id

Conversation

@malob
Copy link
Copy Markdown
Contributor

@malob malob commented Apr 7, 2026

Description

Fixes a bug where +reply, +reply-all, and +forward silently drop the HTML body from Outlook/Exchange messages, falling back to a plain-text conversion that loses all formatting, nested blockquotes, and inline images.

The problem

Outlook/Exchange wraps messages in multipart/related so the HTML body can reference signature images via cid: URLs. As part of this structure, Outlook adds a Content-ID header to the text/html body part itself:

multipart/related
├── multipart/alternative
│   ├── text/plain     (no Content-ID)
│   └── text/html      (Content-ID: <ABC@exchange.example.com>)  ← dropped
├── image/png          (Content-ID, attachmentId)  ← signature images
└── ...

The MIME payload walker's is_body_text_part classification required content_id_header.is_none(), so Outlook HTML bodies were silently skipped. body_html ended up None, and resolve_html_body fell back to converting the plain text with <br> tags.

How we found it

Replied to an Outlook/Exchange email using +reply. The quoted content in the sent message was a flat wall of <br>-separated text instead of the original's nested HTML blockquotes with styling. Inspected the original message via the Gmail API and found the Content-ID header on the text/html part.

The fix

Replace the content_id_header.is_none() condition with an explicit MIME type allowlist (text/plain or text/html). Body parts are now identified by mime type + inline body.data + no attachmentId + no filename, regardless of Content-ID.

This is a tighter predicate than the old code: previously, any text/* subtype without Content-ID (e.g. text/calendar) would have passed the outer is_body_text_part check, though the inner branch already only handled text/plain and text/html. The new check makes the outer condition honest about what the inner branch requires.

Additional improvement: warning for dropped inline-data leaves

Non-text, non-hydratable leaf parts with inline body.data (e.g. an application/json part without an attachmentId) were previously dropped with no diagnostic. A warning is now logged to stderr with the MIME type, filename, and size, making silent data loss debuggable.

Verified with real Outlook email

Created forward drafts of the same Outlook message using the installed (broken) and fixed builds, then compared the raw HTML:

Blockquotes HTML size Result
Before fix 0 5.9 KB Flat <br> text, all nesting lost
After fix 6 20 KB Full HTML with nested blockquotes and styling

Test coverage

704 total tests (3 new). New tests:

  • test_extract_payload_contents_plain_text_with_content_idtext/plain part with Content-ID is still recognized as body text (documents the behavioral change for both MIME types, not just text/html)
  • test_parse_original_message_html_with_content_id_end_to_end — End-to-end regression test through parse_original_messageresolve_html_body with a realistic Outlook MIME structure. Verifies HTML body is preserved, resolve_html_body returns actual HTML (not <br>-converted fallback), and inline signature images are still collected as parts.
  • test_extract_payload_contents_multiple_html_leaves_first_wins — Documents the current DFS first-wins heuristic when multiple text/html parts are eligible, without claiming it is semantically ideal (full multipart/related start= parameter support would be needed for that).

Known limitations (pre-existing, not introduced by this fix)

  • DFS first-wins ordering for multiple eligible HTML leaves. The walker takes the first text/html part it encounters in depth-first order rather than honoring multipart/related's start= parameter for root-part resolution. In practice this is correct for all standard email client MIME structures — the first text/html in multipart/alternative is always the intended body. The start= parameter is rarely used in email (it's more common in SOAP/MTOM). Documented and tested as a heuristic, not a specification guarantee.

  • decode_text_body assumes UTF-8. Non-UTF-8 text parts (ISO-8859-1, Windows-1252, Shift_JIS) are warned and dropped, falling back to plain text or snippet. The Gmail API does not normalize charset — the raw encoding from the sending MTA is preserved. A proper fix would read the charset parameter from the Content-Type header and transcode via a crate like encoding_rs. This primarily affects messages from older mail systems; most modern clients send UTF-8.

  • Non-text inline-data leaves are warned and dropped. Parts like text/calendar or application/json with inline body.data but no attachmentId are logged to stderr but not preserved as attachments. The OriginalPart pipeline assumes attachmentId for later binary hydration, so treating inline-data parts as attachments would require a different fetch path. In practice, the Gmail API almost always externalizes binary data behind attachmentId — we tested with a 26-byte file via messages.insert and it still got an attachmentId.

Checklist:

  • My code follows the AGENTS.md guidelines (no generated google-* crates).
  • I have run cargo fmt --all to format the code perfectly.
  • I have run cargo clippy -- -D warnings and resolved all warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have provided a Changeset file (e.g. via pnpx changeset) to document my changes.

Outlook/Exchange adds Content-ID to the text/html body part for
multipart/related referencing. The MIME walker's is_body_text_part
check required content_id_header.is_none(), causing Outlook HTML
bodies to be silently skipped. Replies fell back to plain-text
conversion, losing all formatting, nested blockquotes, and images.

Fix: replace the Content-ID exclusion with a positive MIME type
allowlist (text/plain or text/html). Body parts are identified by
mime type + inline data + no attachmentId + no filename, regardless
of Content-ID.

Also add a diagnostic warning when non-text, non-hydratable leaf
parts with inline data are dropped, making previously silent data
loss visible on stderr.
@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 7, 2026

🦋 Changeset detected

Latest commit: 0253b04

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@googleworkspace/cli Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug where HTML email bodies from Outlook/Exchange were being incorrectly discarded due to the presence of a Content-ID header. By refining the MIME part classification logic to focus on MIME types rather than the absence of Content-ID, the fix ensures that rich HTML content is preserved during reply and forward operations. Additionally, the change introduces diagnostic logging for dropped parts to improve future maintainability and debugging.

Highlights

  • MIME Walker Logic Update: Updated the MIME walker to explicitly allow text/plain and text/html parts as body content, even when they contain a Content-ID header, preventing Outlook/Exchange emails from losing HTML formatting.
  • Improved Debugging: Added a warning log to stderr for non-text, non-hydratable leaf parts that are dropped during processing, making silent data loss easier to diagnose.
  • Expanded Test Coverage: Added three new tests to verify correct handling of Content-ID in text parts, end-to-end HTML preservation, and DFS traversal behavior for multiple HTML parts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Generative AI Prohibited Use Policy, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@googleworkspace-bot googleworkspace-bot added the area: core Core CLI parsing, commands, error handling, utilities label Apr 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue where Gmail HTML bodies were incorrectly dropped if they contained a Content-ID header, a common occurrence in messages from Outlook or Exchange. The logic in extract_payload_recursive was updated to include text/plain and text/html parts as body content regardless of the presence of a Content-ID. Additionally, a warning is now logged when skipping unrecognized inline parts to aid in debugging, and several regression tests were added to verify the fix and document the behavior of the MIME walker. I have no feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: core Core CLI parsing, commands, error handling, utilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants