Skip to content

Fix Hosting-1 Windows runner crash by disabling test parallelization#15834

Open
radical wants to merge 9 commits intomainfrom
ankj/fix-hosting1-windows-hang
Open

Fix Hosting-1 Windows runner crash by disabling test parallelization#15834
radical wants to merge 9 commits intomainfrom
ankj/fix-hosting1-windows-hang

Conversation

@radical
Copy link
Copy Markdown
Member

@radical radical commented Apr 3, 2026

Description

Fixes the Hosting-1 (windows-latest) CI job repeatedly hanging and crashing the GitHub Actions runner ("lost communication" errors). See #15832.

Root Cause

On 2-core windows-latest runners, multiple CPU-intensive processes compete for limited resources during tests:

  1. xUnit test parallelization — multiple DCP instances run concurrently within a single test partition, each spawning child processes
  2. VBCSCompiler (Roslyn compiler server) — stays resident after build, consuming ~150% CPU / 700MB RAM
  3. Heartbeat monitor — spawns 5 PowerShell processes per cycle at the default 5s interval

Total CPU demand reaches 400-600% on a 200% capacity machine. The Runner.Worker agent starves, can't send heartbeats to GitHub, and the job is killed.

Changes

  1. Disable test parallelization on Windows CI — Pass /p:DisableTestParallelization=true on Windows build steps, which triggers an xunit.runner.json swap (via tests/Directory.Build.targets) to a config with parallelizeTestCollections: false
  2. Suppress VBCSCompiler — Set UseSharedCompilation=false env var on Windows test steps and add dotnet build-server shutdown before tests
  3. Increase heartbeat interval on Windows — From default 5s to 60s to reduce PowerShell process spawn overhead (each cycle spawns 5 powershell.exe processes for metrics collection)

Validation

Tested with reproduce-flaky-tests.yml workflow (Hosting partition 1, 8 runners × windows-latest):

Configuration Heartbeat Result
Main branch (no fix) 5s (default) ~0/9
All fixes applied No heartbeat 16/16
All fixes applied 60s interval 14/16

The remaining ~12% with heartbeat is from PowerShell overhead — filed #15887 to replace with .NET APIs.

Fixes #15832

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
    • No
  • Does the change require an update in our Aspire docs?
    • Yes
    • No

Copilot AI review requested due to automatic review settings April 3, 2026 04:37
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 15834

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 15834"

@radical radical marked this pull request as draft April 3, 2026 04:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the reusable run-tests GitHub Actions workflow to reduce CPU/memory pressure on constrained Windows runners by preventing Roslyn’s compiler server (VBCSCompiler) from staying resident during test execution, addressing the “runner lost communication” failure pattern described in #15832.

Changes:

  • Add dotnet build-server shutdown steps after building test projects to terminate any leftover compiler server processes.
  • Set UseSharedCompilation=false in test execution step environments to prevent dotnet run-triggered builds (via DCP) from starting VBCSCompiler during tests.

@radical radical force-pushed the ankj/fix-hosting1-windows-hang branch 2 times, most recently from 38ec5be to 36ad62f Compare April 3, 2026 07:02
@radical radical changed the title Fix Hosting-1 Windows runner crash by preventing VBCSCompiler during tests Fix Hosting-1 Windows runner crash by disabling test parallelization Apr 3, 2026
@radical radical force-pushed the ankj/fix-hosting1-windows-hang branch 2 times, most recently from f9dd947 to d811a47 Compare April 3, 2026 17:49
Root cause: On 2-core windows-latest runners, xUnit parallelizes test
classes, causing multiple DCP instances + test services to run
concurrently. Each test that starts a DistributedApplication spawns DCP
processes and dotnet child processes. With parallel execution, CPU is
exhausted, starving the runner agent (Runner.Worker) and causing it to
lose communication with GitHub Actions.

Fix (3 changes):

1. Disable test parallelization (XunitAttributes.cs):
   Force Hosting tests to run sequentially with
   DisableTestParallelization=true + MaxParallelThreads=1 so only one
   DCP instance + test services are active at a time. This is the
   primary fix.

2. UseSharedCompilation=false + dotnet build-server shutdown (run-tests.yml):
   Prevents VBCSCompiler (Roslyn compiler server) from staying resident
   during tests. DCP launches projects via 'dotnet run' without
   --no-build, triggering MSBuild which starts VBCSCompiler consuming
   ~150% CPU / 700MB RAM on 2-core runners.

3. Heartbeat interval 5s -> 30s on Windows (run-tests.yml):
   Reduces PowerShell process spawn overhead (5 pwsh invocations per
   cycle, each with ~300-500ms startup cost).

Validation:
- reproduce-flaky-tests.yml on windows-latest (2-core):
  Before fix: 0/9 passed (all runners hung for 44+ min)
  After fix: 8/8 passed in 5-7 min each
- No larger runner needed; fix works on standard 2-core windows-latest

Fixes #15832

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@radical radical force-pushed the ankj/fix-hosting1-windows-hang branch from d811a47 to 05ad89b Compare April 3, 2026 18:25
radical and others added 3 commits April 3, 2026 17:05
Replace three conditional shutdown steps (after each build variant) with one
Windows-only step placed after all builds but before test execution. The command
is fast and idempotent, so a single call covers all build paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create tests/github-actions/xunit.runner.json instead of reusing the Helix
config, so either can evolve independently.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ensures the no-parallelization config is only used when actually running on
GitHub Actions, not when DisableTestParallelization is passed locally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@radical radical force-pushed the ankj/fix-hosting1-windows-hang branch from 5c92be4 to a0f64ab Compare April 3, 2026 22:20
radical and others added 4 commits April 3, 2026 18:29
Pass --parallel none directly to xUnit in the Windows test steps instead of
swapping xunit.runner.json at build time. This is simpler: the entire fix is
now contained in run-tests.yml with no build-time configuration changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CLI-only --parallel none proved insufficient (5/16 hung). Restore the build-time
DisableTestParallelization property that swaps xunit.runner.json, which was
validated at 16/16. Keep --parallel none as belt-and-suspenders.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The --parallel none CLI arg caused 3/16 hangs when combined with the
xunit.runner.json swap. The xunit.runner.json-only approach was validated
at 16/16 passing. Removing the CLI arg to avoid the conflict.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validation data shows heartbeat contributes to runner hangs on 2-core
Windows runners due to 5 PowerShell processes spawned per cycle:
- No heartbeat: 16/16 pass
- 60s heartbeat: 14/16 pass
- 30s heartbeat: 13/16 pass

60s is a better balance between monitoring and CPU overhead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…unit config swap

IsGitHubActionsRunner (from eng/Testing.props) is more specific — it only
applies on GitHub Actions runners, not all CI environments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@radical radical marked this pull request as ready for review April 4, 2026 02:10
@radical radical requested review from davidfowl and joperezr April 4, 2026 02:13
@radical
Copy link
Copy Markdown
Member Author

radical commented Apr 4, 2026

I also opened #15887 to improve heartbeat.cs, which should reduce the overhead from that.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

🎬 CLI E2E Test Recordings — 56 recordings uploaded (commit 2793a97)

View recordings
Test Recording
AddPackageInteractiveWhileAppHostRunningDetached ▶️ View Recording
AddPackageWhileAppHostRunningDetached ▶️ View Recording
AgentCommands_AllHelpOutputs_AreCorrect ▶️ View Recording
AgentInitCommand_DefaultSelection_InstallsSkillOnly ▶️ View Recording
AgentInitCommand_MigratesDeprecatedConfig ▶️ View Recording
AllPublishMethodsBuildDockerImages ▶️ View Recording
AspireAddPackageVersionToDirectoryPackagesProps ▶️ View Recording
AspireUpdateRemovesAppHostPackageVersionFromDirectoryPackagesProps ▶️ View Recording
Banner_DisplayedOnFirstRun ▶️ View Recording
Banner_DisplayedWithExplicitFlag ▶️ View Recording
Banner_NotDisplayedWithNoLogoFlag ▶️ View Recording
CertificatesClean_RemovesCertificates ▶️ View Recording
CertificatesTrust_WithNoCert_CreatesAndTrustsCertificate ▶️ View Recording
CertificatesTrust_WithUntrustedCert_TrustsCertificate ▶️ View Recording
ConfigSetGet_CreatesNestedJsonFormat ▶️ View Recording
CreateAndRunAspireStarterProject ▶️ View Recording
CreateAndRunAspireStarterProjectWithBundle ▶️ View Recording
CreateAndRunEmptyAppHostProject ▶️ View Recording
CreateAndRunJavaEmptyAppHostProject ▶️ View Recording
CreateAndRunJsReactProject ▶️ View Recording
CreateAndRunPythonReactProject ▶️ View Recording
CreateAndRunTypeScriptEmptyAppHostProject ▶️ View Recording
CreateAndRunTypeScriptStarterProject ▶️ View Recording
CreateJavaAppHostWithViteApp ▶️ View Recording
CreateStartAndStopAspireProject ▶️ View Recording
CreateTypeScriptAppHostWithViteApp ▶️ View Recording
DashboardRunWithOtelTracesReturnsNoTraces ▶️ View Recording
DescribeCommandResolvesReplicaNames ▶️ View Recording
DescribeCommandShowsRunningResources ▶️ View Recording
DetachFormatJsonProducesValidJson ▶️ View Recording
DoctorCommand_DetectsDeprecatedAgentConfig ▶️ View Recording
DoctorCommand_WithSslCertDir_ShowsTrusted ▶️ View Recording
DoctorCommand_WithoutSslCertDir_ShowsPartiallyTrusted ▶️ View Recording
GlobalMigration_HandlesCommentsAndTrailingCommas ▶️ View Recording
GlobalMigration_HandlesMalformedLegacyJson ▶️ View Recording
GlobalMigration_PreservesAllValueTypes ▶️ View Recording
GlobalMigration_SkipsWhenNewConfigExists ▶️ View Recording
GlobalSettings_MigratedFromLegacyFormat ▶️ View Recording
InvalidAppHostPathWithComments_IsHealedOnRun ▶️ View Recording
LegacySettingsMigration_AdjustsRelativeAppHostPath ▶️ View Recording
LogsCommandShowsResourceLogs ▶️ View Recording
PsCommandListsRunningAppHost ▶️ View Recording
PsFormatJsonOutputsOnlyJsonToStdout ▶️ View Recording
PublishWithDockerComposeServiceCallbackSucceeds ▶️ View Recording
RestoreGeneratesSdkFiles ▶️ View Recording
RestoreSupportsConfigOnlyHelperPackageAndCrossPackageTypes ▶️ View Recording
RunFromParentDirectory_UsesExistingConfigNearAppHost ▶️ View Recording
RunWithMissingAwaitShowsHelpfulError ▶️ View Recording
SecretCrudOnDotNetAppHost ▶️ View Recording
SecretCrudOnTypeScriptAppHost ▶️ View Recording
StagingChannel_ConfigureAndVerifySettings_ThenSwitchChannels ▶️ View Recording
StopAllAppHostsFromAppHostDirectory ▶️ View Recording
StopAllAppHostsFromUnrelatedDirectory ▶️ View Recording
StopNonInteractiveMultipleAppHostsShowsError ▶️ View Recording
StopNonInteractiveSingleAppHost ▶️ View Recording
StopWithNoRunningAppHostExitsSuccessfully ▶️ View Recording

📹 Recordings uploaded automatically from CI run #23969088750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hosting-1 (windows-latest) test job hangs and crashes runner on main branch CI

2 participants