Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
curl -LsSf https://astral.sh/uv/install.sh | sh
sudo apt-get -qq update
sudo apt-get install -y libemail-outlook-message-perl
uv sync
uv sync --all-extras
export PERL_MM_USE_DEFAULT=1
sudo cpan -f -i Email::Outlook::Message

Expand All @@ -41,6 +41,7 @@ jobs:
uv run mail-parser -v
uv run mail-parser -h
uv run mail-parser -f tests/mails/mail_malformed_3 -j
uv run mail-parser -f tests/mails/mail_outlook_1 -o -j
cat tests/mails/mail_malformed_3 | uv run mail-parser -k -j

- name: Run pre-commit
Expand Down
44 changes: 34 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,20 +36,44 @@ formats, making it versatile for diverse email ecosystems.
**⚡ Production-Ready**: Trusted by security professionals and developers worldwide, with extensive
test coverage and proven reliability in high-stakes environments.

Additionally, mail-parser provides full support for parsing Outlook email formats (.msg). To enable
this functionality on Debian-based systems, simply install the required system package:
mail-parser is fully compatible with Python 3, ensuring modern performance and reliability.

```bash
apt-get install libemail-outlook-message-perl
```
## Parsing Outlook `.msg` files

For further details about the package, you can run:
mail-parser converts Outlook `.msg` files to standard `.eml` before parsing.
Two conversion backends are supported:

```bash
apt-cache show libemail-outlook-message-perl
```
1. **`extract-msg` (recommended, pure Python).** No external tools required.
Install the optional extra:

mail-parser is fully compatible with Python 3, ensuring modern performance and reliability.
```bash
pip install mail-parser[outlook]
```

1. **`msgconvert` (deprecated, external Perl tool).** Requires the
`libemail-outlook-message-perl` system package:

```bash
apt-get install libemail-outlook-message-perl # Debian-based systems
apt-cache show libemail-outlook-message-perl # package details
```

**Backend precedence:** when `extract-msg` is installed it is used first.
Only when it is *not* available does mail-parser fall back to the `msgconvert`
external tool, logging a deprecation warning. If neither backend is available,
`parse_from_file_msg()` raises `MailParserOSError` telling you to install
either path.

> **⚠️ Deprecated:** the `msgconvert` external-tool backend is deprecated and
> will be removed in a future release. Migrate to the pure-Python backend with
> `pip install mail-parser[outlook]`.

**💥 BREAKING CHANGE:** the default `.msg` conversion backend changed.
When `extract-msg` is installed it is now preferred over `msgconvert`. The two
converters produce different intermediate `.eml` output, so some parsed fields
(header ordering, encoding edge cases, attachment naming) can differ from the
previous `msgconvert`-only behavior. Downstream code asserting on exact
`.msg`-derived output may need updating.

# Apache 2 Open Source License

Expand Down
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ maintainers = [
]
dependencies = []

[project.optional-dependencies]
outlook = ["extract-msg>=0.54"]

[dependency-groups]
dev = [
"build>=1.2.2.post1",
Expand Down
22 changes: 21 additions & 1 deletion src/mailparser/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

import base64
import email
import importlib.util
import ipaddress
import json
import logging
Expand All @@ -27,6 +28,7 @@
from mailparser.utils import (
convert_mail_date,
decode_header_part,
extract_msg_convert,
find_between,
get_addresses,
get_header,
Expand Down Expand Up @@ -186,14 +188,32 @@ def from_file_msg(cls, fp):
Init a new object from a Outlook message file,
mime type: application/vnd.ms-outlook

Conversion backend precedence:
1. ``extract-msg`` (pure Python, optional ``outlook`` extra).
2. ``msgconvert`` external Perl tool — **deprecated** fallback,
used only when ``extract-msg`` is not installed.

Args:
fp (string): file path of raw Outlook email

Returns:
Instance of MailParser

Raises:
MailParserOSError: if no conversion backend is available
"""
log.debug("Parsing email from file Outlook")
f, _ = msgconvert(fp)

if importlib.util.find_spec("extract_msg") is not None:
f, _ = extract_msg_convert(fp)
else:
log.warning(
"msgconvert backend is deprecated and will be removed "
"in a future release. Install the pure-Python Outlook "
"support with 'pip install mail-parser[outlook]'."
)
f, _ = msgconvert(fp)

return cls.from_file(f, True)

@classmethod
Expand Down
76 changes: 71 additions & 5 deletions src/mailparser/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,66 @@ def fingerprints(data):
return hashes(md5, sha1, sha256, sha512)


def _new_outlook_tempfile():
"""
Create an empty temporary file to hold a converted Outlook email.

The OS-level file handle is closed immediately; callers write to the
returned path with their own handle (a subprocess ``--outfile`` for
``msgconvert`` or a plain ``open`` for the pure-Python backend).

Returns:
str: path of the new temporary ``.eml`` file
"""
handle, path = tempfile.mkstemp(prefix="outlook_")
os.close(handle)
return path


def extract_msg_convert(fp):
"""
Convert an Outlook ``.msg`` file to ``.eml`` using the pure-Python
``extract-msg`` library (no external Perl tool required).

The ``extract_msg`` import is performed lazily inside this function so
that the package keeps importing with zero runtime dependencies when
the optional ``outlook`` extra is not installed.

Args:
fp (string): file path of the Outlook ``.msg`` mail

Returns:
tuple: ``(eml_path, info)`` where ``eml_path`` is the path of the
converted ``.eml`` file and ``info`` is a short descriptive string

Raises:
ImportError: if the ``extract-msg`` library is not installed
MailParserOSError: if the ``.msg`` is not a convertible email
message (e.g. a contact or calendar item)
"""
import extract_msg # lazy: keep package import stdlib-only

log.debug("Started converting Outlook email with extract-msg")
msg = extract_msg.openMsg(fp)
try:
# openMsg() may return a non-email MSGFile (contact, calendar,
# task...) which cannot be rendered as an email message.
as_email_message = getattr(msg, "asEmailMessage", None)
if as_email_message is None:
raise MailParserOSError(
f"Outlook file {fp!r} is not a convertible email "
f"message (type {type(msg).__name__})"
)
eml = as_email_message()
info = f"{eml.get('From', '')} | {eml.get('Subject', '')}".strip()
temp = _new_outlook_tempfile()
with open(temp, "wb") as f:
f.write(eml.as_bytes())
return temp, info
finally:
msg.close()


def msgconvert(email):
"""
Exec msgconvert tool, to convert msg Outlook
Expand All @@ -329,9 +389,12 @@ def msgconvert(email):
Returns:
tuple with file path of mail converted and
standard output data (str)

Raises:
MailParserOSError: if the ``msgconvert`` tool is not installed
"""
log.debug("Started converting Outlook email")
temph, temp = tempfile.mkstemp(prefix="outlook_")
temp = _new_outlook_tempfile()
command = ["msgconvert", "--outfile", temp, email]

try:
Expand All @@ -343,17 +406,20 @@ def msgconvert(email):
)

except OSError as e:
message = f"Check if 'msgconvert' tool is installed / {e!r}"
message = (
"Cannot convert Outlook .msg: no conversion backend "
"available. Install pure-Python support with "
"'pip install mail-parser[outlook]', or install the "
"'msgconvert' Perl tool "
f"(libemail-outlook-message-perl). {e!r}"
)
log.exception(message)
raise MailParserOSError(message)

else:
stdoutdata, _ = out.communicate()
return temp, stdoutdata.decode("utf-8").strip()

finally:
os.close(temph)


def parse_received(received):
"""
Expand Down
93 changes: 92 additions & 1 deletion tests/test_mail_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,21 @@

import datetime
import hashlib
import logging
import os
import shutil
import sys
import tempfile
import unittest
from unittest.mock import patch

import pytest

import mailparser
from mailparser.exceptions import MailParserOSError
from mailparser.utils import (
convert_mail_date,
extract_msg_convert,
fingerprints,
get_addresses,
get_header,
Expand Down Expand Up @@ -433,9 +438,10 @@ def test_bug_UnicodeDecodeError(self):
self.assertIsInstance(m.mail, dict)
self.assertIsInstance(m.mail_json, str)

@patch("mailparser.core.importlib.util.find_spec", return_value=None)
@patch("mailparser.core.os.remove")
@patch("mailparser.core.msgconvert")
def test_parse_from_file_msg(self, mock_msgconvert, mock_remove):
def test_parse_from_file_msg(self, mock_msgconvert, mock_remove, mock_find_spec):
"""
Tested mail from VirusTotal: md5 b89bf096c9e3717f2d218b3307c69bd0

Expand Down Expand Up @@ -1301,3 +1307,88 @@ def test_get_addresses_multiple_with_email_as_name(self):
self.assertEqual(len(result), 2)
self.assertEqual(result[0], ("alice@example.com", "bob@example.com"))
self.assertEqual(result[1], ("eve@example.com", "frank@example.com"))


# ---------------------------------------------------------------------------
# Outlook .msg conversion backends (extract-msg vs deprecated msgconvert)
# ---------------------------------------------------------------------------


def test_from_file_msg_prefers_extract_msg(mocker):
"""extract-msg is preferred and msgconvert is NOT called when available."""
mocker.patch("importlib.util.find_spec", return_value=object())
extract = mocker.patch(
"mailparser.core.extract_msg_convert",
return_value=(mail_test_2, "info"),
)
msgconv = mocker.patch("mailparser.core.msgconvert")
remove = mocker.patch("mailparser.core.os.remove")

mailparser.parse_from_file_msg(mail_outlook_1)

extract.assert_called_once_with(mail_outlook_1)
msgconv.assert_not_called()
remove.assert_called_once_with(mail_test_2)


def test_from_file_msg_fallback_warns(mocker, caplog):
"""When extract-msg is absent, msgconvert runs and a deprecation warns."""
mocker.patch("importlib.util.find_spec", return_value=None)
msgconv = mocker.patch(
"mailparser.core.msgconvert",
return_value=(mail_test_2, None),
)
mocker.patch("mailparser.core.os.remove")

with caplog.at_level(logging.WARNING, logger="mailparser.core"):
mailparser.parse_from_file_msg(mail_outlook_1)

msgconv.assert_called_once_with(mail_outlook_1)
messages = [r.message for r in caplog.records]
assert any("deprecated" in m for m in messages)
assert any("mail-parser[outlook]" in m for m in messages)


def test_from_file_msg_no_backend_raises(mocker):
"""No backend at all → MailParserOSError mentioning both install paths."""
mocker.patch("importlib.util.find_spec", return_value=None)
mocker.patch(
"mailparser.utils.subprocess.Popen",
side_effect=OSError("no msgconvert"),
)

with pytest.raises(MailParserOSError) as exc:
mailparser.parse_from_file_msg(mail_outlook_1)

assert "mail-parser[outlook]" in str(exc.value)
assert "msgconvert" in str(exc.value)


@pytest.mark.integration
def test_outlook_backend_parity():
"""mail_outlook_1 parses to the same result under both backends.

Requires both the optional ``extract-msg`` dependency and the
``msgconvert`` Perl tool; skips otherwise. The two converters do not
emit byte-identical ``.eml`` files, so only the meaningful parsed
result is compared (key headers, attachment names/count). The raw
body is intentionally not compared: msgconvert and extract-msg differ
in line endings, MIME structure and RTF/HTML reconstruction.
"""

# Force each backend explicitly via its util. from_file(..., True)
# removes the temporary converted .eml after parsing.
f_extract, _ = extract_msg_convert(mail_outlook_1)
parsed_extract = mailparser.MailParser.from_file(f_extract, True)

# Parsing from the original .msg Outlook file
parsed_msgconv = mailparser.MailParser.from_file_msg(mail_outlook_1)

for key in ("from", "to", "subject"):
assert parsed_extract.mail.get(key) == parsed_msgconv.mail.get(key)

assert parsed_extract.date == parsed_msgconv.date

extract_names = sorted(a["filename"] for a in parsed_extract.attachments)
msgconv_names = sorted(a["filename"] for a in parsed_msgconv.attachments)
assert extract_names == msgconv_names
15 changes: 11 additions & 4 deletions tests/test_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,10 +245,17 @@ def test_parse_file_outlook(self, parser, tmp_path):
non_existent_file = str(tmp_path / "non_existent.msg")
args = parser.parse_args(["--file", non_existent_file, "--outlook"])

# Mock msgconvert to raise OSError (simulating msgconvert unavailable)
with patch(
"mailparser.utils.subprocess.Popen",
side_effect=OSError("msgconvert not found"),
# Force the deprecated msgconvert fallback (extract-msg absent) and
# mock msgconvert to raise OSError (simulating msgconvert unavailable)
with (
patch(
"mailparser.core.importlib.util.find_spec",
return_value=None,
),
patch(
"mailparser.utils.subprocess.Popen",
side_effect=OSError("msgconvert not found"),
),
):
with pytest.raises(MailParserOSError, match="msgconvert"):
parse_file(args)
Expand Down
Loading
Loading