feat(transport): add HTTP retry with exponential backoff#1520
Draft
feat(transport): add HTTP retry with exponential backoff#1520
Conversation
|
b083a57 to
a264f66
Compare
Collaborator
Author
|
@sentry review |
Collaborator
Author
|
@cursor review |
This was referenced Feb 16, 2026
Collaborator
Author
|
@sentry review |
Collaborator
Author
|
@cursor review |
Collaborator
Author
|
@cursor review |
Collaborator
Author
|
@cursor review |
243c880 to
d6aa792
Compare
Collaborator
Author
|
@cursor review |
fbbffb2 to
abd5815
Compare
Collaborator
Author
|
@cursor review |
f030cf9 to
22e3fc4
Compare
Collaborator
Author
|
@sentry review |
Collaborator
Author
|
@cursor review |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store parsed fields (ts, count, uuid) alongside the path during the filter phase so handle_result and future debug logging can use them without re-parsing. Also improves sort performance by comparing numeric fields before falling back to string comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log retry attempts at DEBUG level and max-retries-reached at WARN level to make retry behavior observable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…writes Three places independently constructed <database>/cache and wrote envelopes there. Add cache_path to sentry_run_t and introduce sentry__run_write_cache() and sentry__run_move_cache() to centralize the cache directory creation and file operations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CURLOPT_TIMEOUT_MS is a total transfer timeout that could cut off large envelopes. Use CURLOPT_CONNECTTIMEOUT_MS instead so only connection establishment is bounded. For winhttp, limit resolve and connect to 15s but leave send/receive at their defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, sentry__retry_send overcounts remaining files, causing an unnecessary extra poll cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure handle_result so "max retries reached" warnings only fire on actual network failures, not on successful delivery at the last attempt. Separate the warning logic from the cache/discard actions and put the re-enqueue branch first for clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the `can_retry` bool on the transport with a `retry_func` callback, and expose `sentry_transport_retry()` as an experimental public API for explicitly retrying all pending envelopes, e.g. when coming back online. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move retry envelopes from a separate retry/ directory into cache/ so that sentry__cleanup_cache() enforces disk limits for both file formats out of the box. The two formats are distinguishable by length: retry files use <ts>-<count>-<uuid>.envelope (49+ chars) while cache files use <uuid>.envelope (45 chars). Default http_retries to 0 (opt-in). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bgworker is detached during shutdown timeout, retry_poll_task can access retry->run->cache_path after sentry_options_free frees the run. Clone the path so it outlives options and is freed with the bgworker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The bgworker_flush in sentry__retry_flush would delay its flush_task by min(delayed_task_time, timeout) when a 15-minute delayed retry_poll_task existed. This consumed the entire shutdown timeout, leaving 0ms for bgworker_shutdown, which then detached the worker thread. On Windows, winhttp_client_shutdown would close handles still in use by the detached thread, causing a crash. The flush is unnecessary because retry_flush_task is an immediate task and bgworker_shutdown already processes all immediate tasks before the shutdown_task runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit removed bgworker_flush from retry_flush, which caused a race between WinHTTP connect timeout (~2s) and bgworker shutdown (2s). Restore the flush and pass the full timeout to both flush and shutdown — after flush drains in-flight work, shutdown completes near-instantly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make retry count an internal constant (SENTRY_RETRY_ATTEMPTS = 5) and expose only a boolean toggle. Enabled by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 means infinite, not default. Pass 30000ms to match WinHTTP defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a 'scheduled' flag with atomic compare-and-swap to ensure at most one retry_poll_task is queued at a time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move `sealed = 1` before `foreach_matching` in `retry_dump_queue` to prevent the detached worker from writing duplicate envelopes via `retry_enqueue` while the main thread is dumping the queue. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop the delayed retry_poll_task before bgworker_flush to prevent it from delaying the flush_task by min(retry_interval, timeout). Subtract elapsed flush time from the shutdown timeout so the total is bounded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the bgworker is detached after shutdown timeout, retry_dump_queue writes retry files and sets sealed=1. The detached thread could then run retry_flush_task and re-send those files, causing duplicates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The retry system writes cache files directly via its own paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
retry_trigger_task recursively re-triggered itself on network failure, bypassing exponential backoff (UINT64_MAX skips the backoff check) and burning through all 5 retry attempts in milliseconds. Since sentry__retry_send already processes all cached envelopes in a single call, the re-trigger is only ever reached on network failure — exactly the case where it's harmful. Make the trigger one-shot; failed items are left for the regular poll task which respects backoff. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cleanup_cache was gated on sentry__transport_can_retry, which checks for retry_func. Since retry_func is unconditionally set for all HTTP transports, this ran cleanup_cache even with http_retry disabled. Check the option directly instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject negative counts in parse_filename (a corrupted filename like 123--01-<uuid>.envelope parses count=-1 via strtol). Also clamp the count in sentry__retry_backoff to prevent left-shift by a negative amount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22e3fc4 to
ffce486
Compare
Collaborator
Author
|
@cursor review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add HTTP retry with exponential backoff for network failures, modeled after Crashpad's upload retry behavior.
Failed envelopes are stored as
<db>/cache/<ts>-<n>-<uuid>.envelopeand retried on startup after a 100ms throttle, and then with exponential backoff (15min, 30min, 1h, 2h, 8h). When retries are exhausted, and offline caching is enabled, envelopes are stored as<db>/cache/<uuid>.envelopeinstead of being discarded.flowchart TD startup --> R{retry?} R -->|yes| throttle R -->|no| C{cache?} throttle -. 100ms .-> resend resend -->|success| C resend -->|fail| C2[<db>/cache/<br/><ts>-<n>-<uuid>.envelope] C2 --> backoff backoff -. 2ⁿ×15min .-> resend C -->|yes| CACHE[<db>/cache/<br/><uuid>.envelope] C -->|no| discardSee also: https://develop.sentry.dev/sdk/expected-features/#buffer-to-disk
Depends on:
See also: