Skip to content

fix(gstack-gbrain-sync): restore startLockKeepalive to prevent stale-lock takeover#1372

Closed
marko-durasic wants to merge 1 commit intogarrytan:mainfrom
marko-durasic:fix/gbrain-sync-restore-lock-keepalive
Closed

fix(gstack-gbrain-sync): restore startLockKeepalive to prevent stale-lock takeover#1372
marko-durasic wants to merge 1 commit intogarrytan:mainfrom
marko-durasic:fix/gbrain-sync-restore-lock-keepalive

Conversation

@marko-durasic
Copy link
Copy Markdown

@marko-durasic marko-durasic commented May 8, 2026

Problem

bin/gstack-gbrain-sync.ts keeps the 5-minute stale-lock takeover in acquireLock (file mtime older than STALE_LOCK_MS = 5 * 60 * 1000unlinkSync(LOCK_PATH) and proceed), but the keepalive that refreshed the lock file's mtime every 60 s while a long sync was running has been removed.

--full mode is documented in the file's own header as "honest ~25-35 min for big Macs (ED2)." Any concurrent /sync-gbrain invocation arriving more than 5 minutes after a --full run started will:

  1. Stat the existing lock, see mtime is "stale," unlinkSync it.
  2. Write its own LOCK_PATH and proceed.
  3. Both processes then run runMemoryIngest / runBrainSyncPush and saveSyncState against ~/.gstack/.gbrain-sync-state.json simultaneously — the atomic tmp + rename write protects each individual save, but two writers still race; the last writer wins, the other one's last_stages is lost, and partial federated state can leak between runs.

Fix

Re-introduces three small things:

  1. utimesSync import in the fs import line.
  2. startLockKeepalive() function right after releaseLock() — refreshes lock mtime every 60 s, no-op if the lock file is gone or owned by a different PID, errors are swallowed (best-effort).
  3. lockKeepalive handle in main() — started right after acquireLock() returns true; cleared in the existing cleanup closure (which is also run on SIGINT/SIGTERM).

Diff is +24/-1 lines, fully scoped to bin/gstack-gbrain-sync.ts.

Behavior matrix

Mode Before this PR After this PR
--dry-run no lock taken no lock taken (unchanged)
--incremental (~50 ms) lock taken, released; keepalive never fires before release unchanged in practice
--full (25-35 min) concurrent invocation after 5 min races on state JSON concurrent invocation correctly sees a fresh mtime, gets EEXIST on writeFileSync(... { flag: "wx" }), exits with code 2 and the documented "another /sync-gbrain is running" error

Test plan

  • bun build bin/gstack-gbrain-sync.ts --target=bun — clean bundle, 18.20 KB entry point, no type errors.
  • Hand-traced the three call sites against the SIGINT/SIGTERM cleanup closure (interval is cleared before lock is released; both ordering paths are safe).
  • Live --full smoke test (would require a populated gbrain corpus; not executed here).

Discovery context

Surfaced while verifying behavior against the DuReef vendored mirror of gstack (we keep an inspiration/gstack/ tree pinned to a known-good SHA). The mirror at db9447c3 had this function; comparing against main HEAD (443bde05) showed it had been removed while the 5-min staleness check still pointed at it implicitly. Filing here so DuReef and any other downstream consumer doing long --full runs doesn't hit the race after the next mirror sync.

Companion DuReef-side bookkeeping: DuReef/workspace#49 — read-only watch entry in our PORT-CANDIDATES.md so reviewers know to refuse this regression on next chore(inspiration) sync if this upstream PR isn't merged first.

Out of scope

  • No VERSION / CHANGELOG.md bump (left to maintainer's release cadence).
  • No tests added — the lock keepalive is a 60 s timer and effectively untestable without a long-running fixture; visual review of the closure + the existing dry-run smoke is the sane bar here.
  • No related changes to gstack-gbrain-detect's Tier 1 \$mtype case (filed separately).

Made with Cursor


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

…lock takeover

`acquireLock` still classifies any lock file with `mtime` older than
`STALE_LOCK_MS` (5 minutes) as stale and takes it over. The keepalive
that refreshed the lock file's mtime every 60 s while a long sync ran
was removed at some point. With `--full` documented at "honest ~25-35
min for big Macs (ED2)," any second invocation arriving after 5 minutes
will incorrectly take the lock — two sync processes then race on the
shared state JSON (`~/.gstack/.gbrain-sync-state.json`).

This commit re-introduces:

  - `utimesSync` import from `fs`
  - `startLockKeepalive()` — refreshes lock mtime every 60 s while the
    PID matches the lock owner, best-effort error handling.
  - `lockKeepalive` interval handle in `main()`, started right after
    `acquireLock()` succeeds and cleared in the `cleanup` closure
    (also covered by SIGINT/SIGTERM cleanup).

No behavior change for `--dry-run` (still skips lock entirely) or for
short `--incremental` runs (steady-state ~50 ms — the keepalive simply
never fires).

Verified locally with `bun build bin/gstack-gbrain-sync.ts` (clean
bundle, 18.20 KB entry point).

Co-authored-by: Cursor <cursoragent@cursor.com>
@marko-durasic
Copy link
Copy Markdown
Author

Sibling PR: #1373 (Tier 1 $mtype includes url for Streamable HTTP) — both surfaced together while reconciling DuReef's vendored gstack mirror. Independently mergeable; no shared files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant