fix(site-explorer): enforce the declared DPU mode on fallback-serial matches by chet · Pull Request #2687 · NVIDIA/infra-controller

chet · 2026-06-18T06:20:54Z

The per-host matching loop now runs the same NIC/DPU mode check on a DPU paired through fallback_dpu_serial_numbers as it does on one the host reports over PCIe: a BlueField in the wrong mode gets set_nic_mode and the host is power-cycled to apply it, instead of being trusted as already configured. That reset now fires even when the host BMC never enumerated the DPU over PCIe -- the usual reason we end up on the fallback path -- so the queued flip can actually take effect.

Until now the fallback path attached a matched DPU with no mode check. On a host the operator declared nic_mode, that DPU would then be dropped (a NIC-mode host has no managed DPUs), so the host registered as zero-DPU with the flip never issued -- the database read "NIC-mode host" while the BlueField stayed in DPU mode. Incomplete PCIe enumeration (a GB200 dropping a DPU from its inventory, say) is exactly what pushes a DPU-to-NIC migration onto this path, so this was the common case, not an edge.

Adds a regression test -- a NIC-mode host whose DPU is paired only by fallback serial and still reporting DPU mode -- that fails on the old code (the host registers zero-DPU with no set_nic_mode) and passes now.

Supports #2631

copy-pr-bot · 2026-06-18T06:20:58Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-18T06:21:03Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1f8a695-a3ae-4d40-b783-6bcddcd01b79

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

identify_managed_hosts is extended to enforce DPU mode during fallback-serial matching by routing each match through check_and_configure_dpu_mode and classify_matched_dpu. The host reset/wait gate is expanded to also fire when a reconfiguration was issued. A new integration regression test validates the complete flow end-to-end.

Changes

NIC-mode enforcement on fallback DPU serial match

Layer / File(s)	Summary
Fallback-serial DPU mode enforcement and host reset gate `crates/site-explorer/src/lib.rs`	`all_configured` is made mutable to track reconfiguration state; each fallback match is routed through `check_and_configure_dpu_mode` and `classify_matched_dpu`; PCIe-scan candidates are discarded on first confirmed fallback match; `all_dpus_configured_properly_in_host` is flipped to `false` when `set_nic_mode` is required; the cannot-identify-managed-host branch now gates on both missing DPUs and pending reconfiguration, with the warning emitted only for the missing-DPU case.
Integration regression test `crates/site-explorer/tests/site_explorer.rs`	`test_site_explorer_enforces_nic_mode_on_fallback_serial_match` seeds a `NicMode` `ExpectedMachine` with `fallback_dpu_serial_numbers`, executes two site-explorer iterations, asserts `set_nic_mode(Nic)` is issued, verifies no `ExploredManagedHost` is settled as zero-DPU in the same pass, and confirms a host `PowerCycle` is enqueued to apply the mode flip.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Fix NIC-mode enforcement for fallback DPU serial matches #2631: This PR implements the exact fix described in that issue — enforcing NIC-mode on fallback DPU serial matches by routing them through check_and_configure_dpu_mode, differentiating reconfiguration states via classify_matched_dpu, expanding the host-reset path when a mode flip is required, and adding the matching regression test.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and accurately summarizes the primary fix: enforcing DPU mode validation on fallback-serial matches, which is the core objective of the changeset.
Description check	✅ Passed	The description provides comprehensive context for the bug, root cause, impact, and solution, clearly explaining the issue with fallback-matched DPUs and the mode enforcement fix.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chet · 2026-06-18T06:22:21Z

@coderabbitai PTAL, thank you!

coderabbitai · 2026-06-18T06:23:52Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR_NUMBER="2687"

echo "=== PR Info ==="
gh pr view "$PR_NUMBER" --repo "$REPO" --json title,body,files,additions,deletions,headRefName,baseRefName

echo ""
echo "=== Changed Files ==="
gh pr diff "$PR_NUMBER" --repo "$REPO"

Length of output: 15464

@chet — reviewed. The fix is well-targeted and the implementation is clean. A few observations:

crates/site-explorer/src/lib.rs

Core correctness — The substitution of the old is_dpu_in_nic_mode guard with a full check_and_configure_dpu_mode + classify_matched_dpu dispatch is the right approach. The old guard had a subtle latent issue: it gated the skip on host_reported_dpus_total.saturating_sub(host_reported_dpus_nic_mode_total) > 0, which is always false when the BMC reports no PCIe DPUs at all — exactly the fallback scenario. The new path is unambiguous.
NeedsReconfig does not break the loop — When multiple serials are listed in fallback_dpu_serial_numbers and more than one is in the wrong mode, each will fire set_nic_mode and each will set all_dpus_configured_properly_in_host = false. Since set_nic_mode is expected to be idempotent this is harmless, but worth being aware of if the list can grow large in practice.
ModeCheckFailed loses the DPU for this pass — The entry is already removed from dpu_sn_to_endpoint before the check, so a transient failure silently drops it until the next iteration. The tracing::warn! is appropriate, and "skip and retry" is the right policy here, but a structured field for the serial number (e.g., dpu_sn = %dpu_sn) alongside dpu = %dpu_ep.address would aid triage:
```
tracing::warn!(
    dpu_sn = %dpu_sn,
    dpu = %dpu_ep.address,
    error = %err,
    "failed to check fallback-matched DPU mode; skipping this device this pass",
);
```
Outer gate fix — expected_managed_dpus_total > 0 || !all_dpus_configured_properly_in_host is the crux of the bug fix. The comment block explains it well. The inner if expected_managed_dpus_total > 0 guard on the warning log is a nice touch — avoids a spurious "cannot identify managed host" in the pure-reconfig case.

crates/site-explorer/tests/site_explorer.rs

Regression test — The three-assertion structure (mode flip issued, host doesn't settle as zero-DPU, power-cycle fires) maps exactly to the three failure modes described in the PR body. The doc comment is production-quality. The two-iteration design (first for endpoint exploration, second for per-host matching) correctly mirrors the real execution order. No gaps found.

Summary — No blocking issues. The minor logging enhancement in point 3 is worth considering, but the fix is sound and the regression test is thorough. Ready to merge at your discretion.

chet · 2026-06-18T06:35:46Z

@coderabbitai review

coderabbitai · 2026-06-18T06:35:52Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (1)

crates/site-explorer/src/lib.rs (1)
1163-1213: 💤 Low value

Consider adding dpu_sn as a structured field in the ModeCheckFailed warning.

The warning at lines 1203–1208 logs dpu = %dpu_ep.address and error = %err. Adding the serial number (dpu_sn) as a structured field would help operators correlate with the fallback_dpu_serial_numbers configured in the expected-machine manifest, simplifying triage when multiple DPUs are in play.
♻️ Suggested enhancement
 DiscoveredDpu::ModeCheckFailed(err) => {
     tracing::warn!(
         dpu = %dpu_ep.address,
+        dpu_sn = %dpu_sn,
         error = %err,
         "failed to check fallback-matched DPU mode; skipping this device this pass",
     );
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/lib.rs` around lines 1163 - 1213, The warning log in
the DiscoveredDpu::ModeCheckFailed match arm is missing the serial number
context which would help operators correlate failures with the
fallback_dpu_serial_numbers from the expected-machine manifest. Add dpu_sn as a
structured field to the tracing::warn! call alongside the existing dpu and error
fields so operators can better trace issues when multiple DPUs are involved.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/site-explorer/src/lib.rs`:
- Around line 1163-1213: The warning log in the DiscoveredDpu::ModeCheckFailed
match arm is missing the serial number context which would help operators
correlate failures with the fallback_dpu_serial_numbers from the
expected-machine manifest. Add dpu_sn as a structured field to the
tracing::warn! call alongside the existing dpu and error fields so operators can
better trace issues when multiple DPUs are involved.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 430ba90d-f0af-4fc3-86de-4c551c147465

📥 Commits

Reviewing files that changed from the base of the PR and between 6d36a43 and 1eb01e2.

📒 Files selected for processing (2)

crates/site-explorer/src/lib.rs
crates/site-explorer/tests/site_explorer.rs

…matches The per-host matching loop now runs the same NIC/DPU mode check on a DPU paired through `fallback_dpu_serial_numbers` as it does on one the host reports over PCIe: a BlueField in the wrong mode gets `set_nic_mode` and the host is power-cycled to apply it, instead of being trusted as already configured. That reset now fires even when the host BMC never enumerated the DPU over PCIe -- the usual reason we end up on the fallback path -- so the queued flip can actually take effect. Until now the fallback path attached a matched DPU with no mode check. On a host the operator declared `nic_mode`, that DPU would then be dropped (a NIC-mode host has no managed DPUs), so the host registered as zero-DPU with the flip never issued -- the database read "NIC-mode host" while the BlueField stayed in DPU mode. Incomplete PCIe enumeration (a GB200 dropping a DPU from its inventory, say) is exactly what pushes a DPU-to-NIC migration onto this path, so this was the common case, not an edge. Adds a regression test -- a NIC-mode host whose DPU is paired only by fallback serial and still reporting DPU mode -- that fails on the old code (the host registers zero-DPU with no `set_nic_mode`) and passes now. Signed-off-by: Chet Nichols III <chetn@nvidia.com>

chet · 2026-06-18T06:53:05Z

@coderabbitai Thanks for the eyes. Updated -- added dpu_sn as a structured field on the fallback mode-check-failed warn!. PTAL. Resolve if good, thanks!

coderabbitai · 2026-06-18T06:53:38Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR_NUMBER="2687"

gh pr diff "$PR_NUMBER" --repo "$REPO" | grep -A 20 "ModeCheckFailed\|dpu_sn\|warn.*fallback"

Length of output: 3123

@chet — confirmed. The ModeCheckFailed branch now emits:

tracing::warn!(
    dpu = %dpu_ep.address,
    dpu_sn = %dpu_sn,
    error = %err,
    "failed to check fallback-matched DPU mode; skipping this device this pass",
);

Both the device address and the serial number are present as structured fields alongside the error — exactly what was requested. The fix is sound, the regression test is thorough, and the logging is production-ready. Resolving.

🐇 ✅

[resolve]

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

chet force-pushed the gh-issue-2631 branch from 1eb01e2 to 7537187 Compare June 18, 2026 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(site-explorer): enforce the declared DPU mode on fallback-serial matches#2687

fix(site-explorer): enforce the declared DPU mode on fallback-serial matches#2687
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2631

chet commented Jun 18, 2026

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Review skipped

Uh oh!

chet commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Uh oh!

chet commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

chet commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chet commented Jun 18, 2026

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Uh oh!

chet commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Uh oh!

chet commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chet commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading