Skip to content

fix(site-explorer): enforce the declared DPU mode on fallback-serial matches#2687

Draft
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2631
Draft

fix(site-explorer): enforce the declared DPU mode on fallback-serial matches#2687
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2631

Conversation

@chet

@chet chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

The per-host matching loop now runs the same NIC/DPU mode check on a DPU paired through fallback_dpu_serial_numbers as it does on one the host reports over PCIe: a BlueField in the wrong mode gets set_nic_mode and the host is power-cycled to apply it, instead of being trusted as already configured. That reset now fires even when the host BMC never enumerated the DPU over PCIe -- the usual reason we end up on the fallback path -- so the queued flip can actually take effect.

Until now the fallback path attached a matched DPU with no mode check. On a host the operator declared nic_mode, that DPU would then be dropped (a NIC-mode host has no managed DPUs), so the host registered as zero-DPU with the flip never issued -- the database read "NIC-mode host" while the BlueField stayed in DPU mode. Incomplete PCIe enumeration (a GB200 dropping a DPU from its inventory, say) is exactly what pushes a DPU-to-NIC migration onto this path, so this was the common case, not an edge.

Adds a regression test -- a NIC-mode host whose DPU is paired only by fallback serial and still reporting DPU mode -- that fails on the old code (the host registers zero-DPU with no set_nic_mode) and passes now.

Supports #2631

@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1f8a695-a3ae-4d40-b783-6bcddcd01b79

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

identify_managed_hosts is extended to enforce DPU mode during fallback-serial matching by routing each match through check_and_configure_dpu_mode and classify_matched_dpu. The host reset/wait gate is expanded to also fire when a reconfiguration was issued. A new integration regression test validates the complete flow end-to-end.

Changes

NIC-mode enforcement on fallback DPU serial match

Layer / File(s) Summary
Fallback-serial DPU mode enforcement and host reset gate
crates/site-explorer/src/lib.rs
all_configured is made mutable to track reconfiguration state; each fallback match is routed through check_and_configure_dpu_mode and classify_matched_dpu; PCIe-scan candidates are discarded on first confirmed fallback match; all_dpus_configured_properly_in_host is flipped to false when set_nic_mode is required; the cannot-identify-managed-host branch now gates on both missing DPUs and pending reconfiguration, with the warning emitted only for the missing-DPU case.
Integration regression test
crates/site-explorer/tests/site_explorer.rs
test_site_explorer_enforces_nic_mode_on_fallback_serial_match seeds a NicMode ExpectedMachine with fallback_dpu_serial_numbers, executes two site-explorer iterations, asserts set_nic_mode(Nic) is issued, verifies no ExploredManagedHost is settled as zero-DPU in the same pass, and confirms a host PowerCycle is enqueued to apply the mode flip.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • Fix NIC-mode enforcement for fallback DPU serial matches #2631: This PR implements the exact fix described in that issue — enforcing NIC-mode on fallback DPU serial matches by routing them through check_and_configure_dpu_mode, differentiating reconfiguration states via classify_matched_dpu, expanding the host-reset path when a mode flip is required, and adding the matching regression test.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and accurately summarizes the primary fix: enforcing DPU mode validation on fallback-serial matches, which is the core objective of the changeset.
Description check ✅ Passed The description provides comprehensive context for the bug, root cause, impact, and solution, clearly explaining the issue with fallback-matched DPUs and the mode enforcement fix.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@chet

chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai PTAL, thank you!

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR_NUMBER="2687"

echo "=== PR Info ==="
gh pr view "$PR_NUMBER" --repo "$REPO" --json title,body,files,additions,deletions,headRefName,baseRefName

echo ""
echo "=== Changed Files ==="
gh pr diff "$PR_NUMBER" --repo "$REPO"

Length of output: 15464



@chet — reviewed. The fix is well-targeted and the implementation is clean. A few observations:


crates/site-explorer/src/lib.rs

  1. Core correctness — The substitution of the old is_dpu_in_nic_mode guard with a full check_and_configure_dpu_mode + classify_matched_dpu dispatch is the right approach. The old guard had a subtle latent issue: it gated the skip on host_reported_dpus_total.saturating_sub(host_reported_dpus_nic_mode_total) > 0, which is always false when the BMC reports no PCIe DPUs at all — exactly the fallback scenario. The new path is unambiguous.

  2. NeedsReconfig does not break the loop — When multiple serials are listed in fallback_dpu_serial_numbers and more than one is in the wrong mode, each will fire set_nic_mode and each will set all_dpus_configured_properly_in_host = false. Since set_nic_mode is expected to be idempotent this is harmless, but worth being aware of if the list can grow large in practice.

  3. ModeCheckFailed loses the DPU for this pass — The entry is already removed from dpu_sn_to_endpoint before the check, so a transient failure silently drops it until the next iteration. The tracing::warn! is appropriate, and "skip and retry" is the right policy here, but a structured field for the serial number (e.g., dpu_sn = %dpu_sn) alongside dpu = %dpu_ep.address would aid triage:

    tracing::warn!(
        dpu_sn = %dpu_sn,
        dpu = %dpu_ep.address,
        error = %err,
        "failed to check fallback-matched DPU mode; skipping this device this pass",
    );
  4. Outer gate fixexpected_managed_dpus_total > 0 || !all_dpus_configured_properly_in_host is the crux of the bug fix. The comment block explains it well. The inner if expected_managed_dpus_total > 0 guard on the warning log is a nice touch — avoids a spurious "cannot identify managed host" in the pure-reconfig case.


crates/site-explorer/tests/site_explorer.rs

  1. Regression test — The three-assertion structure (mode flip issued, host doesn't settle as zero-DPU, power-cycle fires) maps exactly to the three failure modes described in the PR body. The doc comment is production-quality. The two-iteration design (first for endpoint exploration, second for per-host matching) correctly mirrors the real execution order. No gaps found.

Summary — No blocking issues. The minor logging enhancement in point 3 is worth considering, but the fix is sound and the regression test is thorough. Ready to merge at your discretion.

@chet

chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/site-explorer/src/lib.rs (1)

1163-1213: 💤 Low value

Consider adding dpu_sn as a structured field in the ModeCheckFailed warning.

The warning at lines 1203–1208 logs dpu = %dpu_ep.address and error = %err. Adding the serial number (dpu_sn) as a structured field would help operators correlate with the fallback_dpu_serial_numbers configured in the expected-machine manifest, simplifying triage when multiple DPUs are in play.

♻️ Suggested enhancement
 DiscoveredDpu::ModeCheckFailed(err) => {
     tracing::warn!(
         dpu = %dpu_ep.address,
+        dpu_sn = %dpu_sn,
         error = %err,
         "failed to check fallback-matched DPU mode; skipping this device this pass",
     );
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/lib.rs` around lines 1163 - 1213, The warning log in
the DiscoveredDpu::ModeCheckFailed match arm is missing the serial number
context which would help operators correlate failures with the
fallback_dpu_serial_numbers from the expected-machine manifest. Add dpu_sn as a
structured field to the tracing::warn! call alongside the existing dpu and error
fields so operators can better trace issues when multiple DPUs are involved.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/site-explorer/src/lib.rs`:
- Around line 1163-1213: The warning log in the DiscoveredDpu::ModeCheckFailed
match arm is missing the serial number context which would help operators
correlate failures with the fallback_dpu_serial_numbers from the
expected-machine manifest. Add dpu_sn as a structured field to the
tracing::warn! call alongside the existing dpu and error fields so operators can
better trace issues when multiple DPUs are involved.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 430ba90d-f0af-4fc3-86de-4c551c147465

📥 Commits

Reviewing files that changed from the base of the PR and between 6d36a43 and 1eb01e2.

📒 Files selected for processing (2)
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/tests/site_explorer.rs

…matches

The per-host matching loop now runs the same NIC/DPU mode check on a DPU
paired through `fallback_dpu_serial_numbers` as it does on one the host
reports over PCIe: a BlueField in the wrong mode gets `set_nic_mode` and the
host is power-cycled to apply it, instead of being trusted as already
configured. That reset now fires even when the host BMC never enumerated the
DPU over PCIe -- the usual reason we end up on the fallback path -- so the
queued flip can actually take effect.

Until now the fallback path attached a matched DPU with no mode check. On a
host the operator declared `nic_mode`, that DPU would then be dropped (a
NIC-mode host has no managed DPUs), so the host registered as zero-DPU with
the flip never issued -- the database read "NIC-mode host" while the
BlueField stayed in DPU mode. Incomplete PCIe enumeration (a GB200 dropping a
DPU from its inventory, say) is exactly what pushes a DPU-to-NIC migration
onto this path, so this was the common case, not an edge.

Adds a regression test -- a NIC-mode host whose DPU is paired only by
fallback serial and still reporting DPU mode -- that fails on the old code
(the host registers zero-DPU with no `set_nic_mode`) and passes now.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet

chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai Thanks for the eyes. Updated -- added dpu_sn as a structured field on the fallback mode-check-failed warn!. PTAL. Resolve if good, thanks!

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR_NUMBER="2687"

gh pr diff "$PR_NUMBER" --repo "$REPO" | grep -A 20 "ModeCheckFailed\|dpu_sn\|warn.*fallback"

Length of output: 3123



@chet — confirmed. The ModeCheckFailed branch now emits:

tracing::warn!(
    dpu = %dpu_ep.address,
    dpu_sn = %dpu_sn,
    error = %err,
    "failed to check fallback-matched DPU mode; skipping this device this pass",
);

Both the device address and the serial number are present as structured fields alongside the error — exactly what was requested. The fix is sound, the regression test is thorough, and the logging is production-ready. Resolving.

🐇 ✅

[resolve]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant