Problem Statement
main can break when a pull request passes CI on a stale branch head, then GitHub creates a final merge commit against a newer main that was never tested by the PR checks.
We saw this with PR #1870 and PR #1577. PR #1870 passed Branch Checks on stale head 29d57cc, whose merge-base did not include #1577. The final merge commit ff028ce0 combined #1870 TLS reload shutdown handling with #1577 compute watcher shutdown handling, introduced a duplicate shutdown_tx binding, and caused main Rust lint to fail in https://github.com/NVIDIA/OpenShell/actions/runs/27656754843/job/81792472668.
PR #1945 fixes that immediate break, but the integration gap remains.
Proposed Design
Enable GitHub merge queue for the protected main branch and require queued merge groups to pass the same gates required for normal PRs.
Implementation outline:
- Enable
Require merge queue for the main branch protection/ruleset.
- Add the
merge_group trigger to workflows that publish required PR gate inputs, especially:
.github/workflows/branch-checks.yml
.github/workflows/branch-e2e.yml
.github/workflows/helm-lint.yml
- Confirm
.github/workflows/required-ci-gates.yml can evaluate and publish required gate statuses for merge-group runs, or update it so the required contexts are reported for merge queue validation.
- Keep the required contexts aligned with the existing PR gate contexts:
OpenShell / Branch Checks
OpenShell / E2E
OpenShell / GPU E2E
OpenShell / Helm Lint
- Document the expected maintainer workflow for adding a PR to the merge queue instead of merging directly.
GitHub documentation notes that merge queues validate PR changes applied to the latest target branch and any earlier queued changes, and that GitHub Actions workflows used as required checks must include the merge_group event: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue
Alternatives Considered
Require PR branches to be up to date before merging.
That would likely have caught the #1870/#1577 interaction too, because #1870 would have had to rerun checks after updating to include #1577. However, it pushes more manual branch-update work onto contributors and maintainers. Merge queue is a better fit for a busy main branch because it validates the final integration state without forcing every PR author to repeatedly rebase or merge main by hand.
Rely only on push CI after merge.
This detects breakage after main is already broken, which is what happened here.
Agent Investigation
Problem Statement
maincan break when a pull request passes CI on a stale branch head, then GitHub creates a final merge commit against a newermainthat was never tested by the PR checks.We saw this with PR #1870 and PR #1577. PR #1870 passed Branch Checks on stale head
29d57cc, whose merge-base did not include #1577. The final merge commitff028ce0combined #1870 TLS reload shutdown handling with #1577 compute watcher shutdown handling, introduced a duplicateshutdown_txbinding, and causedmainRust lint to fail in https://github.com/NVIDIA/OpenShell/actions/runs/27656754843/job/81792472668.PR #1945 fixes that immediate break, but the integration gap remains.
Proposed Design
Enable GitHub merge queue for the protected
mainbranch and require queued merge groups to pass the same gates required for normal PRs.Implementation outline:
Require merge queuefor themainbranch protection/ruleset.merge_grouptrigger to workflows that publish required PR gate inputs, especially:.github/workflows/branch-checks.yml.github/workflows/branch-e2e.yml.github/workflows/helm-lint.yml.github/workflows/required-ci-gates.ymlcan evaluate and publish required gate statuses for merge-group runs, or update it so the required contexts are reported for merge queue validation.OpenShell / Branch ChecksOpenShell / E2EOpenShell / GPU E2EOpenShell / Helm LintGitHub documentation notes that merge queues validate PR changes applied to the latest target branch and any earlier queued changes, and that GitHub Actions workflows used as required checks must include the
merge_groupevent: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queueAlternatives Considered
Require PR branches to be up to date before merging.
That would likely have caught the #1870/#1577 interaction too, because #1870 would have had to rerun checks after updating to include #1577. However, it pushes more manual branch-update work onto contributors and maintainers. Merge queue is a better fit for a busy
mainbranch because it validates the final integration state without forcing every PR author to repeatedly rebase or mergemainby hand.Rely only on push CI after merge.
This detects breakage after
mainis already broken, which is what happened here.Agent Investigation
watch-github-actionsto inspect failing run27656754843/ job81792472668.cargo clippy --workspace --all-targets -- -D warningsrejecting unused variableshutdown_txincrates/openshell-server/src/lib.rs.git blameand history showed the production duplicate channel came from the interaction of:e73745f1adding compute watcher shutdown handling.ff028ce0adding TLS reload shutdown handling.29d57cc, which did not include feat(gateway): add reconciler lease for HA multi-replica deployments #1577. The broken code only existed in the final merge commit onto newermain.