Skip to content

fix(machine-controller): wait for a zero-DPU host's boot NIC instead of faulting#2688

Open
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2633
Open

fix(machine-controller): wait for a zero-DPU host's boot NIC instead of faulting#2688
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2633

Conversation

@chet

@chet chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Boot configuration resolves the host's boot interface from its primary machine_interface. A zero-DPU host -- NoDpu, or a BlueField flipped to NicMode -- boots from a plain NIC that only takes its first HostInband lease once the host is up, so for a window after registration it has no boot interface yet. The boot steps (CheckHostConfig, SetBootOrder, CheckBootOrder) now treat that as a wait-and-retry rather than an error. A host with managed DPUs always has its DPU-facing primary set at promotion, so a missing boot interface there is still a genuine fault.

The wait-versus-fault decision is centralized in resolve_boot_interface, and SetBootOrderOutcome gains a Wait variant distinct from WaitingForReboot so the wait reads as an unmet precondition, not a reboot. Adds unit tests for the classification: a zero-DPU host with no boot interface waits, a DPU host without one faults, and a resolved interface is used as-is.

Supports #2633

@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e29ada19-5003-4cca-8f84-021c4b7defee

📥 Commits

Reviewing files that changed from the base of the PR and between 84df75e and e8fef01.

📒 Files selected for processing (2)
  • crates/machine-controller/src/boot_interface.rs
  • crates/machine-controller/src/handler.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/machine-controller/src/boot_interface.rs
  • crates/machine-controller/src/handler.rs

Summary by CodeRabbit

  • Bug Fixes
    • Boot interface handling now waits and retries when the boot network interface isn’t discovered yet, instead of immediately failing.
    • Boot order checks and remediation now distinguish between “awaiting interface discovery” and “boot interface truly missing,” with different outcomes.
    • Clarified boot-order waiting behavior so delays are handled without triggering unnecessary reboot-related actions.

Walkthrough

Introduces BootInterfaceResolution enum (Ready, AwaitingNic, Missing) and a resolve_boot_interface function in boot_interface.rs. Updates handler.rs to use this function in place of boot_interface_target across three boot-order state machine paths, returning a wait outcome for zero-DPU hosts instead of an error when no boot interface is present.

Changes

Boot Interface Resolution and State Machine Integration

Layer / File(s) Summary
BootInterfaceResolution enum and classification logic
crates/machine-controller/src/boot_interface.rs
Defines BootInterfaceResolution with Ready(BootInterfaceTarget), AwaitingNic, and Missing variants. Implements resolve_boot_interface delegating to a private classify_boot_interface function that branches on resolved target availability and managed-DPU presence. Unit tests validate all three classification outcomes.
Handler state-machine adoption and wait propagation
crates/machine-controller/src/handler.rs
Imports resolve_boot_interface and replaces boot_interface_target calls in check_host_boot_config, SetBootOrderState::SetBootOrder, and SetBootOrderState::CheckBootOrder. Adds AwaitingNicwait branching at both boot-order initiation and verification stages. Propagates SetBootOrderOutcome::Wait(reason) through handle_host_boot_order_setup, DPU host-boot-repair, and the HostPlatformConfigurationState::SetBootOrder dispatch. Adds SetBootOrderOutcome::Wait documentation clarifying "wait without reboot" semantics distinct from WaitingForReboot.

Sequence Diagram

sequenceDiagram
  participant CheckHostConfig as check_host_boot_config()
  participant ResolveBootInterface as resolve_boot_interface()
  participant SetBootOrderState as set_host_boot_order()<br/>(SetBootOrder branch)
  participant StateHandlerOutcome as StateHandlerOutcome

  CheckHostConfig->>ResolveBootInterface: resolve_boot_interface(mh_snapshot)
  ResolveBootInterface-->>CheckHostConfig: Ready | AwaitingNic | Missing
  
  alt AwaitingNic: zero-DPU host awaiting NIC discovery
    CheckHostConfig-->>StateHandlerOutcome: HostBootConfigDecision::Wait()
  else Missing: managed-DPU host without boot interface
    CheckHostConfig-->>StateHandlerOutcome: StateHandlerError::GenericError
  else Ready: boot interface discovered
    CheckHostConfig->>SetBootOrderState: proceed to SetBootOrder
    SetBootOrderState->>ResolveBootInterface: resolve_boot_interface(mh_snapshot)
    ResolveBootInterface-->>SetBootOrderState: Ready | AwaitingNic | Missing
    alt AwaitingNic path
      SetBootOrderState-->>StateHandlerOutcome: SetBootOrderOutcome::Wait(reason)
    else Ready path
      SetBootOrderState-->>StateHandlerOutcome: success outcome
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly reflects the core change: transitioning from fault behavior to wait behavior for zero-DPU hosts lacking a boot NIC interface.
Description check ✅ Passed The description provides clear context on the problem domain, the solution approach, and the rationale for treating zero-DPU vs. DPU-managed hosts differently.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@chet

chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai PTAL, thank you!

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@chet: Absolutely, I will review the changes now. Thank you for the ping!

(◕‿◕)ᓀ

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/machine-controller/src/boot_interface.rs (1)

81-115: 💤 Low value

Unit tests adequately cover all classification branches.

The separation of classify_boot_interface from resolve_boot_interface is a sound design choice that permits direct unit testing without constructing full ManagedHostStateSnapshot fixtures. Each branch is exercised with appropriate assertions.

Consider deriving Debug on BootInterfaceResolution to improve diagnostic output in assertion failure messages and tracing contexts.

🔧 Optional: Add Debug derive for diagnostics
+#[derive(Debug)]
 pub enum BootInterfaceResolution {
     /// The boot interface resolved; target it.
     Ready(BootInterfaceTarget),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/boot_interface.rs` around lines 81 - 115, The
BootInterfaceResolution enum lacks a Debug derive, which limits diagnostic
output when assertions fail or during tracing. Locate the
BootInterfaceResolution enum definition in the same file (above the test module)
and add Debug to its derive macro alongside any existing derives to improve
assertion failure messages and tracing contexts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/machine-controller/src/boot_interface.rs`:
- Around line 81-115: The BootInterfaceResolution enum lacks a Debug derive,
which limits diagnostic output when assertions fail or during tracing. Locate
the BootInterfaceResolution enum definition in the same file (above the test
module) and add Debug to its derive macro alongside any existing derives to
improve assertion failure messages and tracing contexts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cac1677d-a28d-4547-b87c-be96ad81e1ba

📥 Commits

Reviewing files that changed from the base of the PR and between 6d36a43 and 6915a58.

📒 Files selected for processing (2)
  • crates/machine-controller/src/boot_interface.rs
  • crates/machine-controller/src/handler.rs

@chet

chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chet

chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai Thanks for the eyes. Updated -- derived Debug on BootInterfaceResolution. PTAL. Resolve if good, thanks!

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current state of the BootInterfaceResolution enum in boot_interface.rs
cat crates/machine-controller/src/boot_interface.rs

Length of output: 4717


🏁 Script executed:

#!/bin/bash
# List open review comments by coderabbit on PR `#2688`
gh pr view 2688 --repo NVIDIA/infra-controller --json reviews,comments
gh api repos/NVIDIA/infra-controller/pulls/2688/comments --jq '.[] | select(.user.login | test("coderabbit"; "i")) | {id: .id, body: .body[0:200], resolved: .resolved}'

Length of output: 15089


🏁 Script executed:

#!/bin/bash
# Get the pull request node ID and resolve the outstanding review thread
PR_NODE_ID=$(gh api repos/NVIDIA/infra-controller/pulls/2688 --jq '.node_id')
echo "PR node_id: $PR_NODE_ID"

# List review threads to find the unresolved one
gh api graphql -f query='
query {
  repository(owner: "NVIDIA", name: "infra-controller") {
    pullRequest(number: 2688) {
      reviewThreads(first: 20) {
        nodes {
          id
          isResolved
          comments(first: 1) {
            nodes {
              body
              author { login }
            }
          }
        }
      }
    }
  }
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.isResolved == false) | {id: .id, author: .comments.nodes[0].author.login, snippet: .comments.nodes[0].body[0:120]}'

Length of output: 868



@chet: The #[derive(Debug)] derive is confirmed in place on BootInterfaceResolution. All review threads are already resolved — nothing outstanding remains.

The implementation is clean and production-grade:

  • classify_boot_interface correctly encapsulates the wait-vs-fault decision, keeping it independently testable.
  • resolve_boot_interface composes cleanly against the snapshot.
  • BootInterfaceResolution now carries Debug, which will surface properly in assertion failures and tracing logs.
  • All three classification branches are covered by focused unit tests.

The PR is in good shape. ✅

🐇 ✨

[resolve]

@chet chet marked this pull request as ready for review June 19, 2026 01:23
@chet chet requested a review from a team as a code owner June 19, 2026 01:23
…of faulting

Boot configuration resolves the host's boot interface from its primary
`machine_interface`. A zero-DPU host -- `NoDpu`, or a BlueField flipped to
`NicMode` -- boots from a plain NIC that only takes its first HostInband lease
once the host is up, so for a window after registration it has no boot
interface yet. The boot-config checks (`check_host_boot_config` and the
`SetBootOrder` / `CheckBootOrder` steps) now treat that as a wait-and-retry
rather than an error. A host with managed DPUs always has its DPU-facing
primary set at promotion, so a missing boot interface there is still a genuine
fault.

The wait-versus-fault decision is centralized in `resolve_boot_interface`:
`check_host_boot_config` returns the existing `HostBootConfigDecision::Wait`,
and `SetBootOrderOutcome` gains a matching `Wait` variant distinct from
`WaitingForReboot` so the wait reads as an unmet precondition, not a reboot.
Adds unit tests for the classification: a zero-DPU host with no boot interface
waits, a DPU host without one faults, and a resolved interface is used as-is.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 261 5 22 103 6 125
machine-validation-runner 691 31 182 257 35 186
machine_validation 691 31 182 257 35 186
nvmetal-carbide 691 31 182 257 35 186
TOTAL 2340 98 568 880 111 683

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants