Skip to content

feat(sync-service): hibernate before suspend to enable GC#4284

Draft
alco wants to merge 9 commits intomainfrom
alco/hibernate-then-suspend
Draft

feat(sync-service): hibernate before suspend to enable GC#4284
alco wants to merge 9 commits intomainfrom
alco/hibernate-then-suspend

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented May 6, 2026

Summary

When consumer suspend is enabled, consumers now hibernate first (triggering GC) before eventually suspending (terminating). Previously, consumers went directly to suspend without hibernating, and due to a large timeout for suspend, consumer processes could hold onto garbage memory for longer than necessary.

  • Add shape_suspend_after config (default 60s) - delay between hibernation and suspension
  • Consumer hibernates on hibernate_after timeout, schedules suspend timer
  • Suspend timer fires after suspend_after ms, terminates process if still idle
  • Any activity cancels the pending suspend timer, restarting the cycle
  • Update ConsumerRegistry.enable_suspend/4 to include suspend_after parameter

Test Plan

  • New test: consumer hibernates first, then suspends after suspend_after ms
  • New test: activity during hibernation cancels pending suspend timer
  • Updated existing enable_suspend test for new 4-arg function
  • All 36 consumer tests pass

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.82%. Comparing base (6399147) to head (6029612).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4284      +/-   ##
==========================================
+ Coverage   54.87%   58.82%   +3.95%     
==========================================
  Files         193      218      +25     
  Lines       19567    22112    +2545     
  Branches     5062     5704     +642     
==========================================
+ Hits        10737    13008    +2271     
- Misses       8828     9100     +272     
- Partials        2        4       +2     
Flag Coverage Δ
packages/agents 58.94% <ø> (ø)
packages/agents-runtime 80.22% <ø> (ø)
packages/agents-server 66.05% <ø> (ø)
packages/agents-server-ui 6.04% <ø> (ø)
packages/electric-ax 38.59% <ø> (ø)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 94.27% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 58.82% <ø> (+3.95%) ⬆️
unit-tests 58.82% <ø> (+3.95%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@alco alco added the claude label May 6, 2026
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Claude Code Review

Summary

Iteration 2 review. The author addressed every actionable item from iteration 1 (changeset added, docstring/comment updates, nil guard on schedule_suspend_timer/1, integration test env var). However, on a closer second read, I found a real race in the "any activity cancels the pending suspend timer" mechanism that I missed (and arguably mis-praised) in iteration 1 — flagging it now.

What's Working Well (Carry-overs and New)

  • Backwards-compatible config plumbing. shape_suspend_after is wired through env var, runtime.exs, Config.@defaults, StackConfig, StackSupervisor NimbleOptions schema, and Application config — fully consistent with shape_hibernate_after / shape_enable_suspend?.
  • Generation-counter pattern (idea). Using a suspend_generation counter to make stale :erlang.send_after messages a no-op when they arrive is a clean, lock-free way to avoid the cancel-timer-and-flush-mailbox dance — provided the generation gets bumped on the events that should invalidate the timer (see Issue Working with Postgres WAL format from Elixir #1 below).
  • Integration test updated. shape-suspension-resumption.lux correctly threads the new ELECTRIC_SHAPE_SUSPEND_AFTER=200ms so the lux test still terminates promptly.
  • Iteration 1 follow-throughs. Changeset added (@core/sync-service: minor), enable_suspend/4 docstring rewritten to explain the hibernate-then-suspend chain, :configure_suspend comment updated, defensive schedule_suspend_timer(%{suspend_after: nil} = state) clause added.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

1. Activity does not actually cancel the pending suspend timer when it arrives in the last hibernate_after ms before the timer fires

File: packages/sync-service/lib/electric/shapes/consumer.ex (handlers throughout) and :411-420 (suspend handler)

Self-correction from iteration 1: My previous review described a reply_with_timeout/1 helper and cancel_suspend_timer/1 function — those don't exist in the code. The timer is not actually cancelled on activity; it is only invalidated via a generation bump, which only happens inside handle_info(:timeout, state) (line 397). Apologies for the confusion that praise may have caused.

Issue: Every activity handler (e.g. handle_call({:handle_event, ...}) at :201, handle_call({:subscribe_materializer, ...}) at :209, handle_info({Storage, :flushed, ...}) at :290, etc.) returns with state.hibernate_after as the next after-timeout but does not touch state.suspend_generation and does not cancel the pending :erlang.send_after. The generation only ticks forward when the GenServer's after-timeout fires next. That means:

  • Suspend timer scheduled at t=T to fire at T+suspend_after (gen=1).
  • Activity arrives at t=A (T < A < T+suspend_after). Handler returns with state.hibernate_after → next :timeout will fire at A+hibernate_after.
  • If A+hibernate_after < T+suspend_after: the next :timeout fires first, bumps gen to 2, original timer becomes stale. ✓
  • If A+hibernate_after ≥ T+suspend_after (i.e. A is in the last hibernate_after ms of the window): the suspend timer fires before the next :timeout. gen=1 still matches state.suspend_generation=1, consumer_suspend_enabled?/consumer_can_suspend? are still true, and the consumer is terminated despite the recent activity. ✗

Concrete bug zone with current defaults (hibernate_after=30s, suspend_after=60s): a transaction arriving anywhere in the last 30s of the 60s suspend window — i.e. 50% of the window — terminates the consumer even though it just did work. The enable_suspend/4 docstring example (60_000, 4 * 60_000, 60_000 * 20) gives a 25% bug zone.

Why it's not Critical: Suspended consumers are restartable on the next request, so this is an unnecessary tear-down/rebuild rather than data loss or a crash. But it directly contradicts the PR description ("Any activity cancels the pending suspend timer, restarting the cycle") and the docstring on handle_info({:suspend_timeout, generation}, state) — and erodes the value of having suspend_after at all (you wanted "X ms of real idleness before terminating," not "X ms after hibernation regardless of activity").

Why the new test passes anyway: consumer_test.exs:1564 uses hibernate_after=10ms, suspend_after=200ms, with activity at ~t=80ms. The next :timeout after activity fires at ~t=90ms — well before the original suspend timer at ~t=210ms — so gen gets bumped and the test passes. The bug zone here is only the last 10ms of 200ms, and the test never lands activity inside it.

Suggested fix (least invasive): bump the generation from any handler that resets state.hibernate_after. A small helper avoids touching every call site:

# Reply/no-reply with the standard hibernate-after timeout, marking any
# pending suspend timer as stale.
defp reply_with_hibernate(state, reply) do
  state = invalidate_suspend_timer(state)
  {:reply, reply, state, state.hibernate_after}
end

defp noreply_with_hibernate(state) do
  state = invalidate_suspend_timer(state)
  {:noreply, state, state.hibernate_after}
end

defp invalidate_suspend_timer(%{suspend_generation: gen} = state) do
  %{state | suspend_generation: gen + 1}
end

Then route the activity handlers through these helpers. (If you'd rather keep call sites explicit, just bump state.suspend_generation in each one.) Either way, an additional regression test that lands activity in the bug zone would be valuable, e.g. hibernate_after: 50, shape_suspend_after: 100 with activity at t≈90ms and an assertion that the process is still alive at t≈150ms.

Alternative: actually cancel the pending Erlang timer (:erlang.cancel_timer/1 + receive after 0), but the generation-bump approach is simpler and doesn't risk the cancel-vs-deliver race.

2. :configure_suspend doesn't invalidate a pending suspend timer either

File: packages/sync-service/lib/electric/shapes/consumer.ex:392-395

def handle_info({:configure_suspend, hibernate_after, suspend_after, jitter_period}, state) do
  state = %{state | hibernate_after: hibernate_after, suspend_after: suspend_after}
  {:noreply, state, Enum.random(hibernate_after..jitter_period)}
end

Same root cause as #1. If suspend was already enabled and a timer is pending (gen=1), then someone calls enable_suspend/4 again with new params, the new params are stored but the old gen=1 timer fires at its original time and terminates the consumer using the old suspend_after value. Bumping the generation here too (e.g. via the same invalidate_suspend_timer/1 helper) makes the call idempotent and predictable.

Suggestions (Nice to Have)

1. Add an explicit assertion in the existing test that the suspend timer was actually cancelled, not just delayed

File: packages/sync-service/test/electric/shapes/consumer_test.exs:1562-1625

The test currently waits 160ms past the original suspend window and then refute_receives. Once #1 is fixed, consider also asserting that state.suspend_generation > 1 after activity — that proves the cancellation mechanism fired, not that timing happened to favor us. Optional, but documents the contract.

2. suspend_generation overflow theoretical, but worth a sentence

File: packages/sync-service/lib/electric/shapes/consumer/state.ex:48

Bumping the generation on every activity will eventually wrap a BigNum… in many years. Not worth a code change, but a one-liner comment that the integer is unbounded by design and Erlang handles arbitrary integers natively will save a future reader the worry.

Issue Conformance

No linked issue (carry-over from iteration 1). PR description and code comments now claim "Any activity cancels the pending suspend timer, restarting the cycle"; once Issue #1 is fixed, that claim will actually be true. Until then, the description is more aspirational than accurate.

Previous Review Status

Iteration 1 item Status
Important #1 — Missing changeset ✅ Added (hibernate-then-suspend.md, minor bump)
Suggestion #1 — Stale enable_suspend/4 docstring ✅ Rewritten with hibernate-then-suspend explanation
Suggestion #2 — Stale :configure_suspend comment ✅ Updated
Suggestion #3 — Tight test timing margins ➖ Left as-is (in line with file conventions)
Suggestion #4suspend_after: nil guard schedule_suspend_timer/1 first clause matches nil
Suggestion #5suspend_after field placement ➖ Left at bottom (now grouped with suspend_generation, which is reasonable)

The two iteration-1 items that praised non-existent helpers (reply_with_timeout/1, cancel_suspend_timer/1) were inaccurate on my part; the code never had them. That misreading is the reason Issue #1 above wasn't flagged in iteration 1.


Review iteration: 2 | 2026-05-07

alco and others added 9 commits May 7, 2026 12:14
Add new config option to control the delay between hibernation and
suspension. Default is 60 seconds. This prepares for hibernate-then-suspend
behavior where consumers first hibernate (to trigger GC) before suspending.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tate

Add state fields to track the scheduled suspend timer and the configured
delay between hibernation and suspension.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When suspend is enabled, consumers now:
1. Hibernate first on timeout (triggering GC)
2. Schedule a suspend timer for suspend_after ms later
3. Suspend (terminate) when the suspend timer fires

Any activity cancels the pending suspend timer, restarting the cycle.
This ensures GC runs before eventual process termination.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change enable_suspend/3 to enable_suspend/4, adding the suspend_after
parameter to configure the delay between hibernation and suspension.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests verifying:
- Consumer hibernates first, then suspends after suspend_after ms
- Activity during hibernation cancels the pending suspend timer
- Update enable_suspend test for new 4-arity function
- Update existing suspend test to use explicit suspend_after tag

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Clarify enable_suspend/4 docstring: jitter controls hibernation timing,
  suspension happens suspend_after ms after hibernation
- Update :configure_suspend handler comment to describe hibernate-then-suspend
- Add defensive guard for suspend_after: nil in schedule_suspend_timer/1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…n test

With hibernate-then-suspend, consumers hibernate first then wait
suspend_after before terminating. Set ELECTRIC_SHAPE_SUSPEND_AFTER=200ms
alongside ELECTRIC_SHAPE_HIBERNATE_AFTER=200ms so the test completes
within the expected timeframe.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@alco alco force-pushed the alco/hibernate-then-suspend branch from 3d7ade0 to 6029612 Compare May 7, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant