fix(site-explorer): enforce the declared DPU mode on fallback-serial matches#2687
fix(site-explorer): enforce the declared DPU mode on fallback-serial matches#2687chet wants to merge 1 commit into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Walkthrough
ChangesNIC-mode enforcement on fallback DPU serial match
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai PTAL, thank you! |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
REPO="NVIDIA/infra-controller"
PR_NUMBER="2687"
echo "=== PR Info ==="
gh pr view "$PR_NUMBER" --repo "$REPO" --json title,body,files,additions,deletions,headRefName,baseRefName
echo ""
echo "=== Changed Files ==="
gh pr diff "$PR_NUMBER" --repo "$REPO"Length of output: 15464
Summary — No blocking issues. The minor logging enhancement in point 3 is worth considering, but the fix is sound and the regression test is thorough. Ready to merge at your discretion. |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/site-explorer/src/lib.rs (1)
1163-1213: 💤 Low valueConsider adding
dpu_snas a structured field in theModeCheckFailedwarning.The warning at lines 1203–1208 logs
dpu = %dpu_ep.addressanderror = %err. Adding the serial number (dpu_sn) as a structured field would help operators correlate with thefallback_dpu_serial_numbersconfigured in the expected-machine manifest, simplifying triage when multiple DPUs are in play.♻️ Suggested enhancement
DiscoveredDpu::ModeCheckFailed(err) => { tracing::warn!( dpu = %dpu_ep.address, + dpu_sn = %dpu_sn, error = %err, "failed to check fallback-matched DPU mode; skipping this device this pass", ); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/site-explorer/src/lib.rs` around lines 1163 - 1213, The warning log in the DiscoveredDpu::ModeCheckFailed match arm is missing the serial number context which would help operators correlate failures with the fallback_dpu_serial_numbers from the expected-machine manifest. Add dpu_sn as a structured field to the tracing::warn! call alongside the existing dpu and error fields so operators can better trace issues when multiple DPUs are involved.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/site-explorer/src/lib.rs`:
- Around line 1163-1213: The warning log in the DiscoveredDpu::ModeCheckFailed
match arm is missing the serial number context which would help operators
correlate failures with the fallback_dpu_serial_numbers from the
expected-machine manifest. Add dpu_sn as a structured field to the
tracing::warn! call alongside the existing dpu and error fields so operators can
better trace issues when multiple DPUs are involved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 430ba90d-f0af-4fc3-86de-4c551c147465
📒 Files selected for processing (2)
crates/site-explorer/src/lib.rscrates/site-explorer/tests/site_explorer.rs
…matches The per-host matching loop now runs the same NIC/DPU mode check on a DPU paired through `fallback_dpu_serial_numbers` as it does on one the host reports over PCIe: a BlueField in the wrong mode gets `set_nic_mode` and the host is power-cycled to apply it, instead of being trusted as already configured. That reset now fires even when the host BMC never enumerated the DPU over PCIe -- the usual reason we end up on the fallback path -- so the queued flip can actually take effect. Until now the fallback path attached a matched DPU with no mode check. On a host the operator declared `nic_mode`, that DPU would then be dropped (a NIC-mode host has no managed DPUs), so the host registered as zero-DPU with the flip never issued -- the database read "NIC-mode host" while the BlueField stayed in DPU mode. Incomplete PCIe enumeration (a GB200 dropping a DPU from its inventory, say) is exactly what pushes a DPU-to-NIC migration onto this path, so this was the common case, not an edge. Adds a regression test -- a NIC-mode host whose DPU is paired only by fallback serial and still reporting DPU mode -- that fails on the old code (the host registers zero-DPU with no `set_nic_mode`) and passes now. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
|
@coderabbitai Thanks for the eyes. Updated -- added |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
REPO="NVIDIA/infra-controller"
PR_NUMBER="2687"
gh pr diff "$PR_NUMBER" --repo "$REPO" | grep -A 20 "ModeCheckFailed\|dpu_sn\|warn.*fallback"Length of output: 3123
tracing::warn!(
dpu = %dpu_ep.address,
dpu_sn = %dpu_sn,
error = %err,
"failed to check fallback-matched DPU mode; skipping this device this pass",
);Both the device address and the serial number are present as structured fields alongside the error — exactly what was requested. The fix is sound, the regression test is thorough, and the logging is production-ready. Resolving. 🐇 ✅ [resolve] |
The per-host matching loop now runs the same NIC/DPU mode check on a DPU paired through
fallback_dpu_serial_numbersas it does on one the host reports over PCIe: a BlueField in the wrong mode getsset_nic_modeand the host is power-cycled to apply it, instead of being trusted as already configured. That reset now fires even when the host BMC never enumerated the DPU over PCIe -- the usual reason we end up on the fallback path -- so the queued flip can actually take effect.Until now the fallback path attached a matched DPU with no mode check. On a host the operator declared
nic_mode, that DPU would then be dropped (a NIC-mode host has no managed DPUs), so the host registered as zero-DPU with the flip never issued -- the database read "NIC-mode host" while the BlueField stayed in DPU mode. Incomplete PCIe enumeration (a GB200 dropping a DPU from its inventory, say) is exactly what pushes a DPU-to-NIC migration onto this path, so this was the common case, not an edge.Adds a regression test -- a NIC-mode host whose DPU is paired only by fallback serial and still reporting DPU mode -- that fails on the old code (the host registers zero-DPU with no
set_nic_mode) and passes now.Supports #2631