Skip to content

Add import-mbox (supports HEY.com exports) with resume/checkpoints#103

Draft
ruphy wants to merge 39 commits intowesm:mainfrom
ruphy:codex/import-mbox
Draft

Add import-mbox (supports HEY.com exports) with resume/checkpoints#103
ruphy wants to merge 39 commits intowesm:mainfrom
ruphy:codex/import-mbox

Conversation

@ruphy
Copy link

@ruphy ruphy commented Feb 7, 2026

Why

msgvault currently supports Gmail sync, but many email providers (notably HEY.com) don’t offer IMAP/POP access. They do offer exports as MBOX (often delivered as a .zip containing one or more .mbox files). This PR adds an offline import path so those accounts can be brought into msgvault with the same local-first guarantees.

What’s in this PR

  • New CLI: msgvault import-mbox <identifier> <export-file>
    • Accepts .mbox/.mbx or a .zip containing .mbox files.
    • Records imported mail under a configurable --source-type (example: --source-type hey).
    • Optional --label to apply a label to newly imported messages.
    • Interrupt-safe imports with checkpoints and resume (default), with --no-resume to start fresh.
  • Deterministic ZIP extraction to dataDir/imports/mbox/<sha256(zip)> with a .done sentinel and stable ordering.
  • Streaming MBOX reader (internal/mbox) with mboxrd-style unescaping.
  • Importer (internal/importer) that stores:
    • raw MIME
    • parsed bodies (text/html)
    • recipients/participants
    • optional attachments (disk + DB)
    • FTS5 indexing when available
  • Resume behavior:
    • Cursor is stored as JSON {file, offset} and uses absolute offsets.
    • Multi-file zip resume restarts from the active file when interrupted mid-import.

How to test

Automated:

  • make test
  • go vet ./...
  • make lint

End-to-end CLI test added:

  • cmd/msgvault/cmd/import_mbox_e2e_test.go writes a real .mbox file to disk, runs the cobra command, and asserts DB + attachment file results.

Manual example (HEY.com export):

./msgvault init-db
./msgvault import-mbox you@hey.com /path/to/hey-export.zip --source-type hey --label hey

Notes

  • --no-attachments disables attachment storage (disk + DB). Existing-message fast-skip means reruns won’t backfill attachments/labels for already-imported messages.
  • Review-loop decisions are recorded in review-loop-decisions.md.

ruphy added 4 commits February 7, 2026 19:24
Support importing email exports from MBOX files or .zip bundles, including HEY.com. Includes streaming MBOX reader, importer with resume checkpoints, and CLI docs/tests.
- Track absolute MBOX offsets after Seek() so checkpoints resume correctly.
- Start multi-file zip imports from the active checkpoint file to avoid discarding progress.
ext := strings.ToLower(filepath.Ext(abs))
switch ext {
case ".mbox", ".mbx":
return []string{abs}, nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 Path traversal via unvalidated zip extraction (high severity)

extractMboxFromZip uses filepath.Base() to flatten paths but does not validate the result against directory traversal sequences before joining with destDir. A malicious zip could contain entries with names like '..%2f..%2fetc%2fpasswd' that survive filepath.Base() encoding. Use filepath.Clean() and check that the result does not escape destDir via filepath.Rel() or strings.HasPrefix() validation after cleaning.


Automated security review by Claude 4.5 Sonnet - Human review still required

if err == nil && len(files) > 0 {
return files, nil
}
// Sentinel exists but no files found; fall through to re-extract.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Uncontrolled resource consumption from zip extraction (medium severity)

extractMboxFromZip does not validate the size of individual zip entries or the total extracted size. A malicious zip could exhaust disk space or cause denial of service. Add limits on individual file size (e.g., check zf.UncompressedSize64) and cumulative extracted bytes before writing.


Automated security review by Claude 4.5 Sonnet - Human review still required

func storeAttachment(st *store.Store, attachmentsDir string, messageID int64, att *mime.Attachment) error {
if attachmentsDir == "" || len(att.Content) == 0 || att.ContentHash == "" {
return nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Attachments written with 0755 directory permissions (medium severity)

storeAttachment creates parent directories with os.MkdirAll(..., 0755), making them world-readable. Attachment content is sensitive (documents, images from 20+ years of email). Use fileutil.SecureMkdirAll(..., 0700) instead to match OAuth token directory permissions and prevent local multi-user disclosure.


Automated security review by Claude 4.5 Sonnet - Human review still required

@ruphy ruphy marked this pull request as draft February 7, 2026 21:42
continue
}
name := filepath.Base(zf.Name)
ext := strings.ToLower(filepath.Ext(name))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 Path traversal in zip extraction via malicious entry names (high severity)

The code flattens zip entry names to base name but still uses zf.Name in the collision hash, allowing an attacker to craft a zip with entries like '../../../.ssh/authorized_keys.mbox' that would be extracted to the intended directory but could cause issues. While filepath.Join prevents traversal above destDir, the logic should validate that zf.Name contains no path separators before processing. Use filepath.Base(zf.Name) consistently or reject entries with directory components.


Automated security review by Claude 4.5 Sonnet - Human review still required

storagePath := filepath.Join(subdir, att.ContentHash)
fullPath := filepath.Join(attachmentsDir, storagePath)

if err := os.MkdirAll(filepath.Dir(fullPath), 0755); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Attachment path not validated before content-addressed storage (medium severity)

The storeAttachment function uses att.ContentHash[:2] as a subdirectory without validating that ContentHash is non-empty or has sufficient length, though line 386 checks len>0. However, if ContentHash is exactly 1 character (malformed), this would panic. Add explicit length validation: if len(att.ContentHash) < 2, return an error before using it for path construction.


Automated security review by Claude 4.5 Sonnet - Human review still required

@ruphy
Copy link
Author

ruphy commented Feb 7, 2026

Addressed the automated security review items:

  • Zip extraction hardening in cmd import-mbox:

    • Normalize zip entry names using ZIP path semantics and validate extracted paths stay under the destination dir.
    • Add per-entry + total extracted byte limits (very high defaults; tests override) to mitigate zip-bomb style DoS.
    • Added tests covering traversal flattening and size limit enforcement.
  • Attachment directory permissions:

    • Use owner-only attachment directories (0700) via fileutil.SecureMkdirAll for both MBOX importer and Gmail sync.

All checks re-run: go test -tags fts5 ./...
? github.com/wesm/msgvault/cmd/msgvault [no test files]
ok github.com/wesm/msgvault/cmd/msgvault/cmd (cached)
ok github.com/wesm/msgvault/internal/config (cached)
ok github.com/wesm/msgvault/internal/deletion (cached)
ok github.com/wesm/msgvault/internal/export (cached)
ok github.com/wesm/msgvault/internal/fileutil (cached)
ok github.com/wesm/msgvault/internal/gmail (cached)
ok github.com/wesm/msgvault/internal/importer (cached)
ok github.com/wesm/msgvault/internal/mbox (cached)
ok github.com/wesm/msgvault/internal/mcp (cached)
ok github.com/wesm/msgvault/internal/mime (cached)
ok github.com/wesm/msgvault/internal/oauth (cached)
ok github.com/wesm/msgvault/internal/query (cached)
? github.com/wesm/msgvault/internal/query/querytest [no test files]
ok github.com/wesm/msgvault/internal/search (cached)
ok github.com/wesm/msgvault/internal/store (cached)
ok github.com/wesm/msgvault/internal/sync (cached)
ok github.com/wesm/msgvault/internal/testutil (cached)
ok github.com/wesm/msgvault/internal/testutil/dbtest (cached)
ok github.com/wesm/msgvault/internal/testutil/email (cached)
? github.com/wesm/msgvault/internal/testutil/ptr [no test files]
ok github.com/wesm/msgvault/internal/testutil/storetest (cached)
? github.com/wesm/msgvault/internal/testutil/tbmock [no test files]
ok github.com/wesm/msgvault/internal/textutil (cached)
ok github.com/wesm/msgvault/internal/tui (cached)
ok github.com/wesm/msgvault/internal/update (cached)
? github.com/wesm/msgvault/scripts/mimeshootout [no test files], , Install golangci-lint: https://golangci-lint.run/usage/install/.

@ruphy
Copy link
Author

ruphy commented Feb 7, 2026

Follow-up security hardening pushed:

  • Zip extraction: collision hash now uses a normalized ZIP path string (forward-slash semantics) and extraction still enforces that output paths stay under the destination directory.
  • Attachments: validate attachment content hashes (len >= 2, lowercase hex) before slicing/using them for content-addressed storage.
    • Added regression test: internal/importer/mbox_import_security_test.go

Re-ran: make test, go vet ./..., make lint.

@ruphy
Copy link
Author

ruphy commented Feb 7, 2026

Addressed the latest automated note about mutable globals for zip limits:

  • Replaced package-level maxZipEntryBytes/maxZipTotalBytes vars with constants + a limits struct passed to an internal helper (extractMboxFromZipWithLimits).
  • Production path uses defaultMaxZipEntryBytes/defaultMaxZipTotalBytes; tests now call the helper with small limits.

Re-ran: make test, go vet ./..., make lint.

@wesm
Copy link
Owner

wesm commented Feb 7, 2026

I'll look at this more closely tomorrow, but thank you for working on this! Highly recommend using https://www.roborev.io/ to help you develop faster

@ruphy
Copy link
Author

ruphy commented Feb 7, 2026

great tip, doing it right now :)

ruphy added 11 commits February 7, 2026 23:22
- Validate From separator lines by parsing the date suffix to avoid
  incorrectly splitting on unescaped "From " lines in message bodies
- Normalize checkpoint file paths to absolute before comparison to
  handle relative/absolute mismatches during resume
- Ensure extracted zip files use absolute destination paths
- Compute attachment content hash when missing instead of silently
  skipping storage
- Set HasAttachments/AttachmentCount to false/0 when attachments
  are disabled
- Improve e2e test isolation by running through full CLI invocation
Changes:
- Fix `internal/mbox` reader to handle long lines (`bufio.ErrBufferFull`) by accumulating partial reads with a max-line cap.
- Accept `"From "` separators with named timezones (adds `"Mon Jan 2 15:04:05 MST 2006"`), and mirror this in importer date parsing.
- Route `import-mbox` output through Cobra’s configured stdout/stderr, and warn if the ZIP extraction sentinel can’t be written.
- Update importer to correct `has_attachments`/`attachment_count` based on attachments actually stored.
- Add tests for long-line MBOX parsing and named-timezone separators (plus importer date parsing).
Proof:
- `GOCACHE=/tmp/go-build go build ./...`
- `GOCACHE=/tmp/go-build go test ./...`
Changes:
- Prevent attachment DB upserts when attachment file `Stat` fails (importer + sync) to avoid records pointing at unwritten files
- Ensure zip extraction collision disambiguation loops until the output name is unique to avoid clobbering on crafted exports
- Batch MBOX import existence checks to avoid per-message `MessageExistsBatch(...[]{id})` DB round trips
- Clarify that Unix `fileutil.Secure*` helpers are best-effort wrappers (no symlink/TOCTOU hardening)
- Add tests for crafted zip name collisions, attachment stat errors, and resume across multi-mbox zip exports
Build/tests:
- `GOCACHE=/tmp/go-build go build ./...`
- `GOCACHE=/tmp/go-build go test ./...`
Changes:
- Validate MBOX inputs up front and fail the sync run on non-mbox files (`internal/importer/mbox_import.go`).
- Mark `sync_runs` as `failed` (not `completed`) when an import finishes with `errors_count > 0` (`internal/importer/mbox_import.go`).
- Print `Import complete (with errors).` when any errors occurred (`cmd/msgvault/cmd/import_mbox.go`).
- Dedupe per-flush pending messages by `SourceMsg` to avoid over-counting `MessagesAdded` for duplicates in the same batch (`internal/importer/mbox_import.go`).
- Add best-effort symlink checks for zip extraction dirs and attachment paths (`cmd/msgvault/cmd/import_mbox.go`, `internal/importer/mbox_import.go`, `internal/sync/sync.go`) plus targeted tests (`internal/importer/mbox_import_test.go`, `cmd/msgvault/cmd/import_mbox_test.go`).
Verification:
- `GOCACHE=/tmp/go-build go build ./...`
- `GOCACHE=/tmp/go-build go test ./...`
Roborev:
Changes:
- Prevent permanent “skips” after partial ingest by skipping only when `message_raw` exists (MBOX importer + Gmail sync).
- Fix `UpsertMessage` to always return the correct message ID after upserts (don’t rely on `LastInsertId` when an upsert resolves to UPDATE).
- `import-mbox`: reject non-regular input paths and sanitize extracted ZIP entry filenames for Windows.
- `import-mbox`: print `Imported (partial)` for the last file when interrupted mid-file.
- Centralize attachment content-hash validation on `export.ValidateContentHash` and avoid nested `append` chains when building address lists.
- Add tests for “partial ingest then rerun repairs raw data”, Windows-invalid ZIP names, and non-regular export path rejection.
- Verified `GOCACHE=/tmp/go-build go build ./...` and `GOCACHE=/tmp/go-build go test ./...`.
Changes:
- Harden zip extraction to avoid writing into pre-populated extraction dirs and to create extracted files exclusively (no clobber/write-through-symlink).
- Reject symlinks/special files when reusing cached zip extractions (via `.done`).
- Enforce a max per-message size in the MBOX reader and use it during import to avoid unbounded memory growth.
- Log a warning (and count an error) if the initial checkpoint save fails.
Verification:
- `GOCACHE=/tmp/go-build go build ./...`
- `GOCACHE=/tmp/go-build go test ./...`
Proof:
- Zip fixes: `cmd/msgvault/cmd/import_mbox.go:320`, `cmd/msgvault/cmd/import_mbox.go:393`, `cmd/msgvault/cmd/import_mbox.go:537`
Changes:
- ImportMbox: on existence-check errors, attempt ingest instead of silently skipping messages.
- ImportMbox: stop advancing (and saving) checkpoints past ingest failures so a resumed run retries the failed message.
- Zip extraction: handle name collisions case-insensitively on macOS/Windows and fix disambiguation base/ext handling.
- Attachments: allow symlinked attachments directories by resolving the symlink target (still reject symlinked subdirs/files).
Proof:
- `GOCACHE=/tmp/go-build go build ./...`
- `GOCACHE=/tmp/go-build go test ./...`
Changes:
- Fix incremental label updates/removals when a message exists but `message_raw` is missing by re-ingesting raw before applying labels.
- Prevent `import-mbox` checkpoint offsets from freezing after the first ingest error (resume offsets continue advancing) and log failed `source_msg`/offset.
- Add a regression test for label removal behavior when raw is missing.
Changes:
- Normalize saved MBOX checkpoint paths and match resume files via `os.SameFile` (importer + `import-mbox` CLI).
- Make MBOX `source_message_id` include an offset disambiguator (raw SHA-256 still used for thread fallback).
- Eliminate attachment write TOCTOU by using atomic `O_EXCL` creates (importer + sync), validating existing files are regular.
- Update MBOX import tests to cover symlink/realpath resume and the new `source_message_id` behavior.
- Document attachment permission expectations in `SECURITY.md`.
- Verified: `GOCACHE=/tmp/go-build go build ./...` and `GOCACHE=/tmp/go-build go test ./...`.
Changes:
- Prevent `import-mbox` checkpoints from advancing past a failed ingest so resume won’t skip failed messages
- Count attachment storage failures in import error totals/checkpoints (without aborting message ingest)
- Compute missing Gmail attachment SHA-256 hashes in `storeAttachment` before validating
- Harden zip extraction cleanup by resolving/import-dir symlink checks and validating extraction paths
- Add focused tests for checkpoint-on-failure and “missing attachment hash” behavior
Changes:
- Added shared `export.StoreAttachmentFile` used by importer/sync to validate existing attachment files (size + SHA-256) before reusing them, preventing silent corruption on dedupe.
- Fixed `import-mbox` signal handler defer order so `signal.Stop` runs before closing the `done` channel (avoids late-signal exit race during shutdown).
- Hardened zip extraction cache reuse by validating cached extracted `.mbox/.mbx` files against the zip’s expected output names + sizes, and rejecting unexpected cached files.
- Made zip filename collision handling always case-fold keys (covers case-insensitive Linux mounts too).
- Added tests for invalid attachment `ContentHash` in sync and for existing-file hash mismatch in attachment storage; verified with `GOCACHE=/tmp/go-build go build ./...` and `GOCACHE=/tmp/go-build go test ./...`.
ruphy added 20 commits February 8, 2026 10:24
Changes:
- Resume multi-file zip imports from the next file after the last completed sync (avoids rescanning already-finished files after an in-between-files interrupt).
- Clear the importer’s pending batch slice on flush to avoid retaining large `Raw` buffers longer than needed.
- Update the MCP export-attachment test to use a temp `HOME` with `Downloads/` so it doesn’t try to write to the real `~/Downloads`.
Verification:
- `GOCACHE=/tmp/go-build go build ./...`
- `GOCACHE=/tmp/go-build go test ./...`
Changes:
- Make `import-mbox` return a non-nil error when the import completes with `totalErrors > 0`.
- Advance the saved resume checkpoint past skippable MBOX reader errors (and log `next_offset`) to avoid re-hitting the same failing region after an interrupt.
- Store attachment `storage_path` with forward slashes (`/`) and convert with `filepath.FromSlash` when joining on disk.
- Fix `resolveMboxExport` unsupported-format error text to include `.mbx`.
- Document `copyWithLimit` overflow behavior (may consume one extra byte from `src`).
- Add tests covering non-zero `import-mbox` exit on partial failure and checkpoint advancement on reader error + interrupt.
- Verified `GOCACHE=/tmp/go-build go build ./...` and `GOCACHE=/tmp/go-build go test ./...`.
Changes:
- Only mark MBOX import sync runs as `failed` on hard ingest errors; complete runs with non-fatal errors so resume-between-files can use the last successful sync.
- Skip strict zip cache size validation when a zip entry reports unknown `UncompressedSize64` (0).
- Move MBOX importer test knobs (`MaxMessageBytes`, `IngestFunc`) into `MboxImportOptions` to avoid global mutation/data races.
- Harden attachment validation by hashing via a single opened FD (uses `O_NOFOLLOW` on Unix) to close the TOCTOU window.
- Add targeted tests for “soft errors still complete sync” and “zip cache validation with unknown size”.
Build: `GOCACHE=/tmp/go-build go build ./...`
Tests: `GOCACHE=/tmp/go-build go test ./...`
Changes:
- Extract ZIP MBOX exports into a fresh temp dir and `os.Rename` into place (avoids `RemoveAll(destDir)` TOCTOU)
- Broaden MBOX `From ` separator detection to accept more common date variants and ignore trailing tokens like `remote from ...` (plus tests)
- Remove redundant extension local in ZIP name collision disambiguation
- Document that `StoreAttachmentFile` resolves a symlinked `attachmentsDir`
Changes:
- Harden zip extraction cache validation: treat empty `.mbox/.mbx` entries as size-known and reject extra unexpected cache files (beyond `.done`).
- Add tests for empty-entry cache size enforcement and polluted cache directories.
- Strengthen attachment storage base path handling by creating `attachmentsDir`, resolving with `EvalSymlinks`, and validating the resolved directory.
- Fix `import-mbox` signal handler teardown to avoid late `os.Exit(130)` from queued signals (close `done` first, drain `sigChan`).
- Verified `GOCACHE=/tmp/go-build go build ./...` and `GOCACHE=/tmp/go-build go test ./...`.
Changes:
- Refused reuse of cached ZIP extraction when a zip entry’s uncompressed size is unknown (prevents accepting truncated/corrupt cached files).
- Fixed multi-file skip logic to only skip a file when `checkpointOffset == fileSize` (ignore last-sync cursor when `checkpointOffset > fileSize`).
- Tracked “repaired” ingests as `MessagesUpdated` (via `MessageExistsBatch`) instead of counting them as `MessagesAdded`, and surfaced `Updated` in CLI output.
- Normalized `attachmentsDir` with `filepath.Abs` before `EvalSymlinks`/writes to avoid surprising relative-path behavior.
- Verified: `GOCACHE=/tmp/go-build go build ./...` and `GOCACHE=/tmp/go-build go test ./...`.
Changes:
- Remove the ineffective symlink rejection after resolving `attachmentsDir` symlinks (still validates the resolved dir). (`internal/export/store_attachment.go`)
- Avoid hashing entire `.zip` exports on resume by using a size+mtime cache key for the extraction directory. (`cmd/msgvault/cmd/import_mbox.go`)
- Make `parseFromLineDate` parse only the expected 5/6-token prefix so `remote from ...` suffixes don’t break fallback date parsing. (`internal/importer/mbox_import.go`)
- Use SQLite `RETURNING id` in `UpsertMessage` to avoid an extra `SELECT id` per message. (`internal/store/messages.go`)
Proof:
- Relevant code: `internal/export/store_attachment.go#L45`, `cmd/msgvault/cmd/import_mbox.go#L365`, `internal/importer/mbox_import.go#L513`, `internal/store/messages.go#L120`
- Build: `GOCACHE=/tmp/go-build go build ./...`
- Tests: `GOCACHE=/tmp/go-build go test ./...`
Changes:
- Fix `StoreAttachmentFile` concurrent-writer false hash/size mismatches by writing to a temp file and renaming into place.
- Add a concurrent-writer unit test for attachment storage de-duping.
- Make `import-mbox` zip extraction cache key depend on zip entry names/sizes/CRC32 to avoid reusing stale extractions.
- Share mbox `"From "` separator date parsing via `mbox.ParseFromSeparatorDate` and use it for importer fallback dates.
- Add a fallback in `UpsertMessage` when SQLite doesn’t support `RETURNING` (Exec + SELECT).
Proof:
- `GOCACHE=/tmp/go-build go build ./...`
- `GOCACHE=/tmp/go-build go test ./...`
Changes:
- Recompute SHA-256 in `StoreAttachmentFile` and reject provided `ContentHash` values that don’t match the attachment bytes (plus a unit test).
- Make `copyWithLimit` fail fast on `(0, nil)` reads (returns `io.ErrNoProgress`) to avoid hangs (plus a unit test).
- Strengthen zip extraction cache validation by checking cached mbox CRC32 (plus a unit test).
- Verified `GOCACHE=/tmp/go-build go build ./...` and `GOCACHE=/tmp/go-build go test ./...`.
Changes:
- Make zip extracted-mbox cache validation O(1) by default (size/limit checks; optional CRC via `MSGVAULT_ZIP_CACHE_VALIDATE_CRC32`)
- Reuse cached extractions even when the zip central directory omits uncompressed sizes (no hard-fail on “unknown size”)
- Strengthen zip cache key by including zip file size + modtime in addition to entry metadata
- Fix zip name disambiguation to derive the extension from the sanitized output name (fallback to original entry)
Changes:
- Make attachment store failures abort ingest before writing `message_raw`, so reruns retry instead of skipping.
- Return `context.Canceled` for `import-mbox` interruptions and map cancellation to exit code 130.
- Enable CRC32 validation by default for small, size-known extracted ZIP cache files; add regression tests for cache corruption, interrupt status, and attachment retry.
Proof:
- Build: `GOCACHE=/tmp/go-build go build ./...`
- Tests: `GOCACHE=/tmp/go-build go test ./...`
- Added tests: `TestImportMbox_RerunRetriesAttachmentsAfterStoreFailure`, `TestImportMboxCmd_ReturnsCanceledWhenContextCanceled`, `TestExtractMboxFromZip_CacheValidationRejectsSameSizeCRCMismatchByDefault`
Changes:
- Run CRC32 cache validation for extracted `.mbox/.mbx` even when the zip central directory reports unknown uncompressed size (when under the CRC threshold), and update the unknown-size cache test to expect tamper rejection.
- Remove zip `ModTime` from `zipMboxCacheKey` to avoid redundant extraction dirs; add a test that touching the zip doesn’t invalidate the cache.
- Clarify the attachment-store rename race comment and document the Windows `openNoFollow` limitation.
Proof:
- `GOCACHE=/tmp/go-build go build ./...` (ok)
- `GOCACHE=/tmp/go-build go test ./...` (ok)
Changes:
- Treat zip entry `Close()` (CRC/integrity) errors as extraction failures and delete the partially written output.
- Warn when cached zip extraction CRC32 validation is skipped due to size.
- Add `ParseFromSeparatorDateStrict` and use it for sent_at fallback to avoid incorrect UTC from unknown TZ abbreviations.
- Add tests for zip checksum corruption and strict From-line TZ parsing.
- Verified: `GOCACHE=/tmp/go-build go build ./...`; `GOCACHE=/tmp/go-build go test ./...`.
Changes:
- Prefix MBOX `source_message_id` with a per-file discriminator to avoid cross-file collisions.
- Treat `import-mbox` CLI as non-failing for soft errors; exit non-zero only when the import reports hard ingest failures.
- On `os.Rename` failure during attachment storage, validate and reuse an existing destination file (better concurrent-writer handling on Windows).
- Document that permissive `From ` separator detection can mis-split unescaped body lines in edge cases.
Verification:
- `GOCACHE=/tmp/go-build go build ./...`
- `GOCACHE=/tmp/go-build go test ./...`
Changes:
- Normalize provided attachment `ContentHash` to lowercase before validation/compare so uppercase hex digests are accepted and canonicalized.
- Keep `messages.has_attachments` / `attachment_count` reflecting MIME reality even under `--no-attachments` (skip storing attachment rows/files) and clarify the flag help text.
- Remove the unused `attachmentErrors` return path by simplifying the MBOX ingest hook to return only `error`, updating call sites/tests.
- Refine ZIP cache `SizeKnown` heuristic to avoid relying on `CRC32 == 0`, and update the related cache validation test.
- Add regression tests for uppercase `ContentHash` acceptance and `--no-attachments` attachment-metadata behavior; verified with `GOCACHE=/tmp/go-build go build ./...` and `GOCACHE=/tmp/go-build go test ./...`.
Changes:
- `internal/importer/mbox_import.go`, `internal/sync/sync.go`: use `len(att.Content)` when upserting attachments so the DB size matches the validated/stored bytes.
- `internal/importer/mbox_import.go`: generate `source_message_id` using a per-message ordinal instead of byte offsets to avoid duplicates from tiny offset shifts.
- `cmd/msgvault/cmd/import_mbox.go`: always CRC32-validate cached extracted `.mbox/.mbx` files by default (remove size-based skip) so large cached files can’t be silently tampered with at same size.
- `internal/export/open_nofollow_unix.go`, `internal/export/open_nofollow_other.go`: narrow the Unix build tag and add a non-Unix fallback to avoid non-Windows/non-Unix build breakage.
Changes:
- Make MBOX `source_message_id` stable across resume by using `nextOffset` (file position) instead of a monotonic ordinal.
- Derive `fileDisc` from a cheap content fingerprint (size + first/last 64KiB) so re-imports are idempotent across path changes.
- Remove redundant `seenOK` duplicate-guarding in `flushPending`, and add tests covering resume-after-hard-error and path-change idempotency.
Proof:
- `GOCACHE=/tmp/go-build go build ./...`
- `GOCACHE=/tmp/go-build go test ./...`
Changes:
- Validate resumed MBOX checkpoint offsets against file size and fail sync when the offset is beyond EOF (with test).
- Reduce zip cache validation I/O by skipping CRC32 checks for size-known cached mbox files unless explicitly enabled (tests updated).
- Fix symlink rejection for the zip extraction imports dir by checking the pre-`EvalSymlinks` path (with test).
- Add an option to disable mboxrd `>From ` unescaping (with test) and document the behavior.
- Clarify in `import-mbox --no-attachments` help that attachments won’t be backfilled on later reruns.
Changes:
- Always CRC32-validate zip extraction cache hits to avoid reusing corrupted extracted `.mbox` content.
- Tighten attachment directory permissions on existing directories and cache successful attachment validations to avoid repeated full-file SHA256 re-hashing.
- Make `import-mbox` `source_message_id` stable by using a per-file message sequence number and persist that sequence in checkpoints for correct resume behavior.
Verification: `GOCACHE=/tmp/go-build go build ./...` and `GOCACHE=/tmp/go-build go test ./...`.
Changes:
- Avoid full-file CRC32 rescans on cached ZIP reruns by default; still CRC-validate when ZIP uncompressed size is unknown, and allow opt-in CRC validation via `MSGVAULT_ZIP_CACHE_VALIDATE_CRC32`.
- Make attachment validation cache safer by including file `mtime` in the cache key so rewrites trigger re-validation.
- Reword MBOX discriminator comment to reflect it’s a content-derived fingerprint, not true file identity.
- Only exit with code 130 on `context.Canceled` when the top-level signal context is actually canceled.
@github-actions
Copy link

github-actions bot commented Feb 8, 2026

Security Review: 4 High/Medium Issues Found

Claude's automated security review identified potential security concerns. Please review each finding below.

Note: 1 low severity issue(s) were omitted to reduce noise.


🚨 Potential path traversal via mboxFileDiscriminator fallback (high)

Location: cmd/msgvault/cmd/import_mbox.go:119

When mboxFileDiscriminator() fails, it falls back to a hash of cpFile (line 122-124). If cpFile contains attacker-controlled symlinks or path components, this could allow path confusion attacks. The fileDisc is later used in source_message_id generation, which could enable collision attacks if an attacker can influence the file path. Always validate that cpFile is within expected boundaries and consider using only the resolved absolute path for hashing.


⚠️ Zip extraction parent directory symlink check insufficient (medium)

Location: cmd/msgvault/cmd/import_mbox.go:489

Line 489 checks if destDir parent is a symlink via Lstat, but this is a TOCTOU race: an attacker could replace the directory with a symlink between the check and subsequent operations. The extraction then proceeds to write files under this potentially attacker-controlled location. Consider using O_NOFOLLOW-style operations or validating the entire path chain atomically before extraction.


⚠️ Zip entry uncompressed size validation bypassed when zero (medium)

Location: cmd/msgvault/cmd/import_mbox.go:663

Lines 663-666 allow extracting entries with UncompressedSize64==0 by skipping size validation and treating them as 'best-effort cacheable'. An attacker could craft a zip with zero uncompressed size fields to bypass extraction limits, then deliver arbitrarily large files when extracted. The code does validate against limits.MaxEntryBytes during extraction, but the cache validation path (line 799) may accept oversized files if the original extraction succeeded via a race or manual file placement.


⚠️ Attachment directory symlink check bypassed after mkdir (medium)

Location: internal/export/store_attachment.go:68

After SecureMkdirAll (line 61), the code resolves symlinks in baseDir (line 58-60) but then checks subdirPath at line 72 without re-validating baseDir. If an attacker replaces baseDir with a symlink between EvalSymlinks and the subdirPath check, files could be written outside the attachments directory. Move the symlink validation of baseDir to immediately before file operations, or use O_NOFOLLOW-style opens throughout.


Powered by Claude 4.5 Sonnet — this is an automated review, false positives are possible.

@ruphy
Copy link
Author

ruphy commented Feb 8, 2026

@wesm it did 31 review rounds and still not done, but at this point it seems like it's doing a whole review of the codebase which honestly seems to fall outside the scope of this PR ;)

feel free to merge at any step

@wesm
Copy link
Owner

wesm commented Feb 8, 2026

Oh, did you just run refine without any guardrails? You need to be a little more in the loop than that. I’ll take a look when I can.

@roborev-ci
Copy link

roborev-ci bot commented Feb 9, 2026

roborev: Combined Review

Synthesis unavailable. Showing raw review outputs.

Agent: codex | Type: security | Status: done

Summary
I reviewed the new MBOX import flow, zip extraction hardening, attachment storage, and sync changes with a security-first lens. I found 1 low-severity denial‑of‑service risk related to zip extraction entry count.

Findings

  • Low cmd/msgvault/cmd/import_mbox.go:~250 — Zip extraction enforces byte limits but does not cap the number of extracted .mbox/.mbx entries. A crafted zip containing a very large count of tiny .mbox files can bypass size limits and exhaust inodes/CPU/time (zip‑bomb variant).
    Remediation: Add a maximum entry count (configurable), and reject archives whose central directory contains more than that number of relevant files before extraction.
Agent: gemini | Type: security | Status: done

This
review covers a large set of commits introducing an import-mbox feature. This is a security-sensitive change as it involves parsing and processing user-provided files (MBOX and ZIP archives), which can be a vector for various attacks.

The review found that the new feature has been implemented with significant attention to security.
Protections against path traversal (Zip Slip), resource exhaustion (Zip Bombs), and symlink-based attacks are robustly implemented.

Furthermore, a refactoring related to the new functionality resolved a pre-existing high-severity TOCTOU (Time-of-Check, Time-of-Use) vulnerability in a
different part of the application.

Findings

  • Severity: High (Fixed)
  • File and line reference: internal/sync/sync.go
  • Description: The previous implementation of the storeAttachment function (in internal/sync/sync.go) was
    vulnerable to a TOCTOU race condition. The function first checked if a file existed with os.Stat and then, if it didn't, wrote the new attachment file. An attacker could create a symbolic link at the expected file path between the check and the write operation. This could lead to an arbitrary file write,
    allowing an attacker to overwrite files the user has write-access to.
  • Remediation: This vulnerability was fixed by refactoring the attachment handling logic. The new export.StoreAttachmentFile function is now used for all attachment storage. This hardened function avoids the race condition by writing to a temporary file and then using
    an atomic os.Rename operation. It also incorporates additional crucial security checks, such as using O_NOFOLLOW on Unix to prevent opening symlinks and validating that parent directories are not symlinks. This is a comprehensive fix that effectively mitigates the vulnerability.

No new issues found.

Agent: codex | Type: review | Status: done

Findings

  1. Mediuminternal/importer/mbox_import.go:~300
    The per‑file sequence counter (msgSeq) is only incremented on successful r.Next() calls. If a message read fails (e.g., ErrMessageTooLarge), msgSeq does not advance, but the reader has already moved past that message. If the user later re‑imports after raising MaxMessageBytes or fixing the file, subsequent messages get different source_message_ids and will be duplicated instead of updated.
    Suggested fix: increment msgSeq (and update lastCheckpointSeq) when r.Next() returns an error that still advanced to a new separator (e.g., errors.Is(err, mbox.ErrMessageTooLarge) or when r.NextFromOffset() moves forward). Add a regression test: fail on a large message, re‑run with a higher limit, and assert no duplicates.

  2. Lowcmd/msgvault/cmd/import_mbox.go:~210 (in extractMboxFromZipWithLimits / expectedMboxFilesFromZip)
    Zip extraction limits cap bytes but not entry count. A zip with millions of tiny .mbox files can still exhaust inodes/CPU (zip‑bomb variant) while staying under byte limits.
    Suggested fix: add a configurable max entry count (checked from the central directory before extraction) and a test that rejects zips above the limit.

Summary
Adds MBOX import support (CLI + importer), robust zip extraction and caching, a streaming MBOX reader with strict date parsing, attachment storage hardening with hash validation, and sync changes that repair missing raw data, plus extensive tests and doc updates.

Open Questions / Assumptions

  • I’m assuming source_message_id is intended to be stable across re‑imports even after earlier read errors. If re‑imports after errors are explicitly “best effort” and duplication is acceptable, the msgSeq issue may be moot.
Agent: gemini | Type: review | Status: done

This change adds an import-mbox command for importing emails from MBOX files, including those within zip archives, and refactors attachment handling to improve security.

Review Findings

  • Severity: Low
  • Location: cmd/msgvault/cmd/import_mbox.go (function extractMboxFromZipWithLimits)
  • Problem: The zip extraction logic limits the total byte size of extracted files but does not limit the number of file entries within
    the archive. A malicious zip file could contain an extremely large number of small or empty .mbox files, bypassing the size-based protection. This could lead to resource exhaustion (inodes, CPU, time) during extraction, causing a denial of service. This issue was noted in a previous review but was not addressed.
  • Fix:
    Before iterating through the zip file entries for extraction, add a check to ensure the total number of files (len(zr.File)) does not exceed a reasonable, configurable limit.

@wesm
Copy link
Owner

wesm commented Feb 13, 2026

I haven't forgotten about this. I have been in triage mode this week and will work on reviewing and testing this with my own .mbox files when I am able, hopefully this weekend!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants