fix(gmail): preserve HTML body when text/html part has Content-ID (Outlook)#683
fix(gmail): preserve HTML body when text/html part has Content-ID (Outlook)#683malob wants to merge 1 commit intogoogleworkspace:mainfrom
Conversation
Outlook/Exchange adds Content-ID to the text/html body part for multipart/related referencing. The MIME walker's is_body_text_part check required content_id_header.is_none(), causing Outlook HTML bodies to be silently skipped. Replies fell back to plain-text conversion, losing all formatting, nested blockquotes, and images. Fix: replace the Content-ID exclusion with a positive MIME type allowlist (text/plain or text/html). Body parts are identified by mime type + inline data + no attachmentId + no filename, regardless of Content-ID. Also add a diagnostic warning when non-text, non-hydratable leaf parts with inline data are dropped, making previously silent data loss visible on stderr.
🦋 Changeset detectedLatest commit: 0253b04 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug where HTML email bodies from Outlook/Exchange were being incorrectly discarded due to the presence of a Content-ID header. By refining the MIME part classification logic to focus on MIME types rather than the absence of Content-ID, the fix ensures that rich HTML content is preserved during reply and forward operations. Additionally, the change introduces diagnostic logging for dropped parts to improve future maintainability and debugging. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request fixes an issue where Gmail HTML bodies were incorrectly dropped if they contained a Content-ID header, a common occurrence in messages from Outlook or Exchange. The logic in extract_payload_recursive was updated to include text/plain and text/html parts as body content regardless of the presence of a Content-ID. Additionally, a warning is now logged when skipping unrecognized inline parts to aid in debugging, and several regression tests were added to verify the fix and document the behavior of the MIME walker. I have no feedback to provide.
Description
Fixes a bug where
+reply,+reply-all, and+forwardsilently drop the HTML body from Outlook/Exchange messages, falling back to a plain-text conversion that loses all formatting, nested blockquotes, and inline images.The problem
Outlook/Exchange wraps messages in
multipart/relatedso the HTML body can reference signature images viacid:URLs. As part of this structure, Outlook adds aContent-IDheader to thetext/htmlbody part itself:The MIME payload walker's
is_body_text_partclassification requiredcontent_id_header.is_none(), so Outlook HTML bodies were silently skipped.body_htmlended upNone, andresolve_html_bodyfell back to converting the plain text with<br>tags.How we found it
Replied to an Outlook/Exchange email using
+reply. The quoted content in the sent message was a flat wall of<br>-separated text instead of the original's nested HTML blockquotes with styling. Inspected the original message via the Gmail API and found theContent-IDheader on thetext/htmlpart.The fix
Replace the
content_id_header.is_none()condition with an explicit MIME type allowlist (text/plainortext/html). Body parts are now identified by mime type + inlinebody.data+ noattachmentId+ no filename, regardless of Content-ID.This is a tighter predicate than the old code: previously, any
text/*subtype without Content-ID (e.g.text/calendar) would have passed the outeris_body_text_partcheck, though the inner branch already only handledtext/plainandtext/html. The new check makes the outer condition honest about what the inner branch requires.Additional improvement: warning for dropped inline-data leaves
Non-text, non-hydratable leaf parts with inline
body.data(e.g. anapplication/jsonpart without anattachmentId) were previously dropped with no diagnostic. A warning is now logged to stderr with the MIME type, filename, and size, making silent data loss debuggable.Verified with real Outlook email
Created forward drafts of the same Outlook message using the installed (broken) and fixed builds, then compared the raw HTML:
<br>text, all nesting lostTest coverage
704 total tests (3 new). New tests:
test_extract_payload_contents_plain_text_with_content_id—text/plainpart with Content-ID is still recognized as body text (documents the behavioral change for both MIME types, not justtext/html)test_parse_original_message_html_with_content_id_end_to_end— End-to-end regression test throughparse_original_message→resolve_html_bodywith a realistic Outlook MIME structure. Verifies HTML body is preserved,resolve_html_bodyreturns actual HTML (not<br>-converted fallback), and inline signature images are still collected as parts.test_extract_payload_contents_multiple_html_leaves_first_wins— Documents the current DFS first-wins heuristic when multipletext/htmlparts are eligible, without claiming it is semantically ideal (fullmultipart/relatedstart=parameter support would be needed for that).Known limitations (pre-existing, not introduced by this fix)
DFS first-wins ordering for multiple eligible HTML leaves. The walker takes the first
text/htmlpart it encounters in depth-first order rather than honoringmultipart/related'sstart=parameter for root-part resolution. In practice this is correct for all standard email client MIME structures — the firsttext/htmlinmultipart/alternativeis always the intended body. Thestart=parameter is rarely used in email (it's more common in SOAP/MTOM). Documented and tested as a heuristic, not a specification guarantee.decode_text_bodyassumes UTF-8. Non-UTF-8 text parts (ISO-8859-1, Windows-1252, Shift_JIS) are warned and dropped, falling back to plain text or snippet. The Gmail API does not normalize charset — the raw encoding from the sending MTA is preserved. A proper fix would read thecharsetparameter from theContent-Typeheader and transcode via a crate likeencoding_rs. This primarily affects messages from older mail systems; most modern clients send UTF-8.Non-text inline-data leaves are warned and dropped. Parts like
text/calendarorapplication/jsonwith inlinebody.databut noattachmentIdare logged to stderr but not preserved as attachments. TheOriginalPartpipeline assumesattachmentIdfor later binary hydration, so treating inline-data parts as attachments would require a different fetch path. In practice, the Gmail API almost always externalizes binary data behindattachmentId— we tested with a 26-byte file viamessages.insertand it still got anattachmentId.Checklist:
AGENTS.mdguidelines (no generatedgoogle-*crates).cargo fmt --allto format the code perfectly.cargo clippy -- -D warningsand resolved all warnings.pnpx changeset) to document my changes.