Skip to content

feat(transport): add HTTP retry with exponential backoff#1520

Draft
jpnurmi wants to merge 65 commits intomasterfrom
jpnurmi/feat/http-retry
Draft

feat(transport): add HTTP retry with exponential backoff#1520
jpnurmi wants to merge 65 commits intomasterfrom
jpnurmi/feat/http-retry

Conversation

@jpnurmi
Copy link
Collaborator

@jpnurmi jpnurmi commented Feb 13, 2026

Add HTTP retry with exponential backoff for network failures, modeled after Crashpad's upload retry behavior.

Failed envelopes are stored as <db>/cache/<ts>-<n>-<uuid>.envelope and retried on startup after a 100ms throttle, and then with exponential backoff (15min, 30min, 1h, 2h, 8h). When retries are exhausted, and offline caching is enabled, envelopes are stored as <db>/cache/<uuid>.envelope instead of being discarded.

flowchart TD
    startup --> R{retry?}
    R -->|yes| throttle
    R -->|no| C{cache?}
    throttle -. 100ms .-> resend
    resend -->|success| C
    resend -->|fail| C2[&lt;db&gt;/cache/<br/>&lt;ts&gt;-&lt;n&gt;-&lt;uuid&gt;.envelope]
    C2 --> backoff
    backoff -. 2ⁿ×15min .-> resend
    C -->|yes| CACHE[&lt;db&gt;/cache/<br/>&lt;uuid&gt;.envelope]
    C -->|no| discard
Loading

See also: https://develop.sentry.dev/sdk/expected-features/#buffer-to-disk

Depends on:

See also:

@github-actions
Copy link

github-actions bot commented Feb 13, 2026

Messages
📖 Do not forget to update Sentry-docs with your feature once the pull request gets approved.

Generated by 🚫 dangerJS against ffce486

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from b083a57 to a264f66 Compare February 13, 2026 17:47
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from 243c880 to d6aa792 Compare February 16, 2026 19:41
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from fbbffb2 to abd5815 Compare February 17, 2026 08:34
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@cursor review

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from f030cf9 to 22e3fc4 Compare February 17, 2026 10:24
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@cursor review

jpnurmi and others added 24 commits February 17, 2026 11:43
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store parsed fields (ts, count, uuid) alongside the path during the
filter phase so handle_result and future debug logging can use them
without re-parsing. Also improves sort performance by comparing
numeric fields before falling back to string comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log retry attempts at DEBUG level and max-retries-reached at WARN
level to make retry behavior observable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…writes

Three places independently constructed <database>/cache and wrote
envelopes there. Add cache_path to sentry_run_t and introduce
sentry__run_write_cache() and sentry__run_move_cache() to centralize
the cache directory creation and file operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CURLOPT_TIMEOUT_MS is a total transfer timeout that could cut off large
envelopes. Use CURLOPT_CONNECTTIMEOUT_MS instead so only connection
establishment is bounded. For winhttp, limit resolve and connect to 15s
but leave send/receive at their defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, sentry__retry_send overcounts remaining files, causing an
unnecessary extra poll cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure handle_result so "max retries reached" warnings only fire
on actual network failures, not on successful delivery at the last
attempt. Separate the warning logic from the cache/discard actions and
put the re-enqueue branch first for clarity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the `can_retry` bool on the transport with a `retry_func`
callback, and expose `sentry_transport_retry()` as an experimental
public API for explicitly retrying all pending envelopes, e.g. when
coming back online.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move retry envelopes from a separate retry/ directory into cache/ so
that sentry__cleanup_cache() enforces disk limits for both file formats
out of the box. The two formats are distinguishable by length: retry
files use <ts>-<count>-<uuid>.envelope (49+ chars) while cache files
use <uuid>.envelope (45 chars). Default http_retries to 0 (opt-in).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bgworker is detached during shutdown timeout, retry_poll_task can
access retry->run->cache_path after sentry_options_free frees the run.
Clone the path so it outlives options and is freed with the bgworker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The bgworker_flush in sentry__retry_flush would delay its flush_task by
min(delayed_task_time, timeout) when a 15-minute delayed retry_poll_task
existed. This consumed the entire shutdown timeout, leaving 0ms for
bgworker_shutdown, which then detached the worker thread. On Windows,
winhttp_client_shutdown would close handles still in use by the detached
thread, causing a crash.

The flush is unnecessary because retry_flush_task is an immediate task
and bgworker_shutdown already processes all immediate tasks before the
shutdown_task runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit removed bgworker_flush from retry_flush, which
caused a race between WinHTTP connect timeout (~2s) and bgworker
shutdown (2s). Restore the flush and pass the full timeout to both
flush and shutdown — after flush drains in-flight work, shutdown
completes near-instantly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make retry count an internal constant (SENTRY_RETRY_ATTEMPTS = 5) and
expose only a boolean toggle. Enabled by default.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 means infinite, not default. Pass 30000ms to match WinHTTP defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a 'scheduled' flag with atomic compare-and-swap to ensure at most
one retry_poll_task is queued at a time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move `sealed = 1` before `foreach_matching` in `retry_dump_queue` to
prevent the detached worker from writing duplicate envelopes via
`retry_enqueue` while the main thread is dumping the queue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop the delayed retry_poll_task before bgworker_flush to prevent it
from delaying the flush_task by min(retry_interval, timeout). Subtract
elapsed flush time from the shutdown timeout so the total is bounded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the bgworker is detached after shutdown timeout, retry_dump_queue
writes retry files and sets sealed=1. The detached thread could then
run retry_flush_task and re-send those files, causing duplicates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The retry system writes cache files directly via its own paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
retry_trigger_task recursively re-triggered itself on network failure,
bypassing exponential backoff (UINT64_MAX skips the backoff check) and
burning through all 5 retry attempts in milliseconds.

Since sentry__retry_send already processes all cached envelopes in a
single call, the re-trigger is only ever reached on network failure —
exactly the case where it's harmful. Make the trigger one-shot; failed
items are left for the regular poll task which respects backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cleanup_cache was gated on sentry__transport_can_retry, which checks
for retry_func. Since retry_func is unconditionally set for all HTTP
transports, this ran cleanup_cache even with http_retry disabled.
Check the option directly instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject negative counts in parse_filename (a corrupted filename like
123--01-<uuid>.envelope parses count=-1 via strtol). Also clamp the
count in sentry__retry_backoff to prevent left-shift by a negative
amount.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from 22e3fc4 to ffce486 Compare February 17, 2026 10:49
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments