Make jobs cancelled due to a soft stop immediately available#1290
Make jobs cancelled due to a soft stop immediately available#1290brandur wants to merge 1 commit into
Conversation
00b244b to
3c09064
Compare
|
@bgentry Do you agree with the rationale for this one? |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c090646e1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| if softStopped { | ||
| params := riverdriver.JobSetStateErrorAvailable(jobRow.ID, now, ptrutil.Ptr(max(jobRow.Attempt-1, 0)), errData, metadataUpdates) |
There was a problem hiding this comment.
Keep soft-stop records out of retry backoff
For jobs that are soft-stopped and then later fail for a real reason, this path resets attempt but still appends errData to errors. The default retry policy uses len(job.Errors)+1 rather than attempt, so a deployment stop still changes future retry behavior: the first genuine failure after one soft stop is treated like the second error and backs off for 16s instead of 1s, with repeated stops inflating it further. This means soft-stop cancellations still count for retry scheduling even though the attempt count is restored.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed! Changed this to be more like snooze and not persist the cancellation error at all.
While working on #1289, I realized that jobs which are "soft stopped" via context cancellation are still prone to the same side effects as if they errored in any other way: * Their number of attempts is incremented. * They may be discarded if reaching max attempts. * They'll have to wait to be retried according to retry policy. This doesn't really seem right because these jobs didn't actually misbehave in any way, but were rather just slow-to-run jobs that couldn't finish cleanly inside the default stop allowance while a client was restarting or being deployed. The proper behavior should probably be more like a snooze. i.e. The soft timeout cancellation doesn't count and the jobs get a chance to be retried immediately. Here, make that change.
3c09064 to
2e49010
Compare
While working on #1289, I realized that jobs which are "soft stopped"
via context cancellation are still prone to the same side effects as if
they errored in any other way:
This doesn't really seem right because these jobs didn't actually
misbehave in any way, but were rather just slow-to-run jobs that
couldn't finish cleanly inside the default stop allowance while a client
was restarting or being deployed.
The proper behavior should probably be more like a snooze. i.e. The soft
timeout cancellation doesn't count and the jobs get a chance to be
retried immediately. Here, make that change.