fix(machine-controller): wait for a zero-DPU host's boot NIC instead of faulting#2688
fix(machine-controller): wait for a zero-DPU host's boot NIC instead of faulting#2688chet wants to merge 1 commit into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
Summary by CodeRabbit
WalkthroughIntroduces ChangesBoot Interface Resolution and State Machine Integration
Sequence DiagramsequenceDiagram
participant CheckHostConfig as check_host_boot_config()
participant ResolveBootInterface as resolve_boot_interface()
participant SetBootOrderState as set_host_boot_order()<br/>(SetBootOrder branch)
participant StateHandlerOutcome as StateHandlerOutcome
CheckHostConfig->>ResolveBootInterface: resolve_boot_interface(mh_snapshot)
ResolveBootInterface-->>CheckHostConfig: Ready | AwaitingNic | Missing
alt AwaitingNic: zero-DPU host awaiting NIC discovery
CheckHostConfig-->>StateHandlerOutcome: HostBootConfigDecision::Wait()
else Missing: managed-DPU host without boot interface
CheckHostConfig-->>StateHandlerOutcome: StateHandlerError::GenericError
else Ready: boot interface discovered
CheckHostConfig->>SetBootOrderState: proceed to SetBootOrder
SetBootOrderState->>ResolveBootInterface: resolve_boot_interface(mh_snapshot)
ResolveBootInterface-->>SetBootOrderState: Ready | AwaitingNic | Missing
alt AwaitingNic path
SetBootOrderState-->>StateHandlerOutcome: SetBootOrderOutcome::Wait(reason)
else Ready path
SetBootOrderState-->>StateHandlerOutcome: success outcome
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai PTAL, thank you! |
|
(◕‿◕)ᓀ ✅ Action performedReview finished.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/machine-controller/src/boot_interface.rs (1)
81-115: 💤 Low valueUnit tests adequately cover all classification branches.
The separation of
classify_boot_interfacefromresolve_boot_interfaceis a sound design choice that permits direct unit testing without constructing fullManagedHostStateSnapshotfixtures. Each branch is exercised with appropriate assertions.Consider deriving
DebugonBootInterfaceResolutionto improve diagnostic output in assertion failure messages and tracing contexts.🔧 Optional: Add Debug derive for diagnostics
+#[derive(Debug)] pub enum BootInterfaceResolution { /// The boot interface resolved; target it. Ready(BootInterfaceTarget),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/boot_interface.rs` around lines 81 - 115, The BootInterfaceResolution enum lacks a Debug derive, which limits diagnostic output when assertions fail or during tracing. Locate the BootInterfaceResolution enum definition in the same file (above the test module) and add Debug to its derive macro alongside any existing derives to improve assertion failure messages and tracing contexts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/machine-controller/src/boot_interface.rs`:
- Around line 81-115: The BootInterfaceResolution enum lacks a Debug derive,
which limits diagnostic output when assertions fail or during tracing. Locate
the BootInterfaceResolution enum definition in the same file (above the test
module) and add Debug to its derive macro alongside any existing derives to
improve assertion failure messages and tracing contexts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cac1677d-a28d-4547-b87c-be96ad81e1ba
📒 Files selected for processing (2)
crates/machine-controller/src/boot_interface.rscrates/machine-controller/src/handler.rs
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai Thanks for the eyes. Updated -- derived |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Check the current state of the BootInterfaceResolution enum in boot_interface.rs
cat crates/machine-controller/src/boot_interface.rsLength of output: 4717 🏁 Script executed: #!/bin/bash
# List open review comments by coderabbit on PR `#2688`
gh pr view 2688 --repo NVIDIA/infra-controller --json reviews,comments
gh api repos/NVIDIA/infra-controller/pulls/2688/comments --jq '.[] | select(.user.login | test("coderabbit"; "i")) | {id: .id, body: .body[0:200], resolved: .resolved}'Length of output: 15089 🏁 Script executed: #!/bin/bash
# Get the pull request node ID and resolve the outstanding review thread
PR_NODE_ID=$(gh api repos/NVIDIA/infra-controller/pulls/2688 --jq '.node_id')
echo "PR node_id: $PR_NODE_ID"
# List review threads to find the unresolved one
gh api graphql -f query='
query {
repository(owner: "NVIDIA", name: "infra-controller") {
pullRequest(number: 2688) {
reviewThreads(first: 20) {
nodes {
id
isResolved
comments(first: 1) {
nodes {
body
author { login }
}
}
}
}
}
}
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.isResolved == false) | {id: .id, author: .comments.nodes[0].author.login, snippet: .comments.nodes[0].body[0:120]}'Length of output: 868
The implementation is clean and production-grade:
The PR is in good shape. ✅ 🐇 ✨ [resolve] |
…of faulting Boot configuration resolves the host's boot interface from its primary `machine_interface`. A zero-DPU host -- `NoDpu`, or a BlueField flipped to `NicMode` -- boots from a plain NIC that only takes its first HostInband lease once the host is up, so for a window after registration it has no boot interface yet. The boot-config checks (`check_host_boot_config` and the `SetBootOrder` / `CheckBootOrder` steps) now treat that as a wait-and-retry rather than an error. A host with managed DPUs always has its DPU-facing primary set at promotion, so a missing boot interface there is still a genuine fault. The wait-versus-fault decision is centralized in `resolve_boot_interface`: `check_host_boot_config` returns the existing `HostBootConfigDecision::Wait`, and `SetBootOrderOutcome` gains a matching `Wait` variant distinct from `WaitingForReboot` so the wait reads as an unmet precondition, not a reboot. Adds unit tests for the classification: a zero-DPU host with no boot interface waits, a DPU host without one faults, and a resolved interface is used as-is. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2688.docs.buildwithfern.com/infra-controller |
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Boot configuration resolves the host's boot interface from its primary
machine_interface. A zero-DPU host --NoDpu, or a BlueField flipped toNicMode-- boots from a plain NIC that only takes its first HostInband lease once the host is up, so for a window after registration it has no boot interface yet. The boot steps (CheckHostConfig,SetBootOrder,CheckBootOrder) now treat that as a wait-and-retry rather than an error. A host with managed DPUs always has its DPU-facing primary set at promotion, so a missing boot interface there is still a genuine fault.The wait-versus-fault decision is centralized in
resolve_boot_interface, andSetBootOrderOutcomegains aWaitvariant distinct fromWaitingForRebootso the wait reads as an unmet precondition, not a reboot. Adds unit tests for the classification: a zero-DPU host with no boot interface waits, a DPU host without one faults, and a resolved interface is used as-is.Supports #2633