perf: decode text-part transfer encoding only once#90
Merged
Conversation
text/plain and text/html parts were transfer-decoded twice: once via get_body_raw() for attachments.content, then again via get_body() for text_plain/text_html, which re-runs the identical base64/quoted-printable decode before applying the charset. Decode the transfer encoding once with get_body_raw() and reuse those bytes for both the attachment content and the text bodies, applying only the charset step via decode_charset() (a faithful copy of mailparse's internal get_body_as_string, using the same charset crate). Output is byte-identical. ~1.9x faster on base64/quoted-printable-encoded text bodies (8.84ms -> 4.71ms median on a ~2MB base64 text/html part); no measurable change on bodies that are not transfer-encoded. All 91 correctness tests pass. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
text/plainandtext/htmlparts were transfer-decoded twice:get_body_raw()→ bytes forattachments[].contentget_body()→ string fortext_plain/text_html, which re-runs the same base64/quoted-printable decode before applying the charset.This decodes the transfer encoding once (
get_body_raw()), reuses those bytes for both the attachment content and the text bodies, and applies only the charset step via a newdecode_charset()helper — a faithful copy of mailparse's own internalget_body_as_stringusing the samecharsetcrate (already a transitive dep). Output is byte-identical.Benchmark
Targeted input — a ~2 MB base64-encoded
text/htmlbody (the path this change touches), 200 iterations, same input both builds:~1.9× faster on base64/quoted-printable-encoded text bodies. The existing
tests/benchmark(large_message.eml) is unchanged within noise — that corpus is dominated by base64 attachments, which carry anameparam and were only ever decoded once.Risk
decode_charsetreplicates mailparse's charset logic exactly (same crate, same code path). For 7bit/8bit text it reduces to the priorget_as_string(raw); for base64/QP it reduces toget_decoded_as_string()minus the redundant transfer decode.get_body_raw()?), preserving the existingParseErrorcontract.cargo clippy --releaseclean.