Fix Hosting-1 Windows runner crash by disabling test parallelization#15834
Open
Fix Hosting-1 Windows runner crash by disabling test parallelization#15834
Conversation
Contributor
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 15834Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 15834" |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the reusable run-tests GitHub Actions workflow to reduce CPU/memory pressure on constrained Windows runners by preventing Roslyn’s compiler server (VBCSCompiler) from staying resident during test execution, addressing the “runner lost communication” failure pattern described in #15832.
Changes:
- Add
dotnet build-server shutdownsteps after building test projects to terminate any leftover compiler server processes. - Set
UseSharedCompilation=falsein test execution step environments to preventdotnet run-triggered builds (via DCP) from starting VBCSCompiler during tests.
38ec5be to
36ad62f
Compare
f9dd947 to
d811a47
Compare
Root cause: On 2-core windows-latest runners, xUnit parallelizes test classes, causing multiple DCP instances + test services to run concurrently. Each test that starts a DistributedApplication spawns DCP processes and dotnet child processes. With parallel execution, CPU is exhausted, starving the runner agent (Runner.Worker) and causing it to lose communication with GitHub Actions. Fix (3 changes): 1. Disable test parallelization (XunitAttributes.cs): Force Hosting tests to run sequentially with DisableTestParallelization=true + MaxParallelThreads=1 so only one DCP instance + test services are active at a time. This is the primary fix. 2. UseSharedCompilation=false + dotnet build-server shutdown (run-tests.yml): Prevents VBCSCompiler (Roslyn compiler server) from staying resident during tests. DCP launches projects via 'dotnet run' without --no-build, triggering MSBuild which starts VBCSCompiler consuming ~150% CPU / 700MB RAM on 2-core runners. 3. Heartbeat interval 5s -> 30s on Windows (run-tests.yml): Reduces PowerShell process spawn overhead (5 pwsh invocations per cycle, each with ~300-500ms startup cost). Validation: - reproduce-flaky-tests.yml on windows-latest (2-core): Before fix: 0/9 passed (all runners hung for 44+ min) After fix: 8/8 passed in 5-7 min each - No larger runner needed; fix works on standard 2-core windows-latest Fixes #15832 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d811a47 to
05ad89b
Compare
Replace three conditional shutdown steps (after each build variant) with one Windows-only step placed after all builds but before test execution. The command is fast and idempotent, so a single call covers all build paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create tests/github-actions/xunit.runner.json instead of reusing the Helix config, so either can evolve independently. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ensures the no-parallelization config is only used when actually running on GitHub Actions, not when DisableTestParallelization is passed locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5c92be4 to
a0f64ab
Compare
Pass --parallel none directly to xUnit in the Windows test steps instead of swapping xunit.runner.json at build time. This is simpler: the entire fix is now contained in run-tests.yml with no build-time configuration changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CLI-only --parallel none proved insufficient (5/16 hung). Restore the build-time DisableTestParallelization property that swaps xunit.runner.json, which was validated at 16/16. Keep --parallel none as belt-and-suspenders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The --parallel none CLI arg caused 3/16 hangs when combined with the xunit.runner.json swap. The xunit.runner.json-only approach was validated at 16/16 passing. Removing the CLI arg to avoid the conflict. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validation data shows heartbeat contributes to runner hangs on 2-core Windows runners due to 5 PowerShell processes spawned per cycle: - No heartbeat: 16/16 pass - 60s heartbeat: 14/16 pass - 30s heartbeat: 13/16 pass 60s is a better balance between monitoring and CPU overhead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
10 tasks
…unit config swap IsGitHubActionsRunner (from eng/Testing.props) is more specific — it only applies on GitHub Actions runners, not all CI environments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
I also opened #15887 to improve |
Contributor
|
🎬 CLI E2E Test Recordings — 56 recordings uploaded (commit View recordings
📹 Recordings uploaded automatically from CI run #23969088750 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes the Hosting-1 (windows-latest) CI job repeatedly hanging and crashing the GitHub Actions runner ("lost communication" errors). See #15832.
Root Cause
On 2-core
windows-latestrunners, multiple CPU-intensive processes compete for limited resources during tests:Total CPU demand reaches 400-600% on a 200% capacity machine. The Runner.Worker agent starves, can't send heartbeats to GitHub, and the job is killed.
Changes
/p:DisableTestParallelization=trueon Windows build steps, which triggers an xunit.runner.json swap (viatests/Directory.Build.targets) to a config withparallelizeTestCollections: falseUseSharedCompilation=falseenv var on Windows test steps and adddotnet build-server shutdownbefore testspowershell.exeprocesses for metrics collection)Validation
Tested with
reproduce-flaky-tests.ymlworkflow (Hosting partition 1, 8 runners × windows-latest):The remaining ~12% with heartbeat is from PowerShell overhead — filed #15887 to replace with .NET APIs.
Fixes #15832
Checklist