Skip to content

feat: honor a declared primary host NIC when selecting the boot device#2690

Draft
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:honor-declared-primary-interface
Draft

feat: honor a declared primary host NIC when selecting the boot device#2690
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:honor-declared-primary-interface

Conversation

@chet

@chet chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What

Makes a host's declared ExpectedHostNic.primary authoritative when selecting its boot device, for the paths that own a NIC today (zero-DPU / NIC-mode hosts).

primary is meant to declare the boot interface, but ingestion ignored it -- predicted interfaces always promoted as non-primary, so a declared NIC never stuck and the host booted via the lowest-MAC fallback until an operator ran set-primary-interface. This records the declared primary on the prediction so it survives promotion, resolves it through one ExpectedMachineData::declared_primary_mac() the writers share, and honors it in the DHCP find-or-create path.

It also settles the multi-NIC pre-ingestion window: several NICs leasing before ingestion each defaulted to primary, and adopting the second one tripped the one_primary_interface_per_machine index and failed the host's ingestion (OnePrimaryInterface). Ingestion now reconciles the adopted rows to exactly one primary.

Nothing declared -> today's automation stands (DPU takeover during ingestion, else the pick_boot_interface lowest-MAC fallback).

Scope

Part of #2657 (epic #2660). The genuine-DPU-mode case -- booting a declared integrated NIC while the DPU stays managed -- is split into #2668: there, demoting the DPU host interface must come together with owning the integrated NIC as the replacement primary admin interface (coupled by the admin-primary invariant), so it can't ride along here.

Test plan

  • cargo clippy --all-features --tests clean over the touched crates; cargo +nightly fmt.
  • New tests: a declared_primary_mac resolver unit test; zero-DPU integration tests for declared-primary-via-prediction and multi-NIC no-collision adoption; an api-core DHCP arrival-order test. Existing seed + DPU ingestion tests still pass.

@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c57b410-bfd7-4053-8a0b-b2f3624a8ac2

📥 Commits

Reviewing files that changed from the base of the PR and between 5c25a50 and 0d1d8e3.

📒 Files selected for processing (10)
  • crates/api-core/src/dhcp/discover.rs
  • crates/api-core/src/handlers/bmc_endpoint_explorer.rs
  • crates/api-core/src/tests/expected_machine.rs
  • crates/api-db/migrations/20260616215917_predicted_machine_interface_primary.sql
  • crates/api-db/src/machine_interface.rs
  • crates/api-db/src/predicted_machine_interface.rs
  • crates/api-model/src/expected_machine.rs
  • crates/api-model/src/predicted_machine_interface.rs
  • crates/site-explorer/src/machine_creator.rs
  • crates/site-explorer/tests/zero_dpu.rs
🚧 Files skipped from review as they are similar to previous changes (10)
  • crates/api-core/src/handlers/bmc_endpoint_explorer.rs
  • crates/api-db/migrations/20260616215917_predicted_machine_interface_primary.sql
  • crates/api-core/src/tests/expected_machine.rs
  • crates/api-core/src/dhcp/discover.rs
  • crates/api-model/src/expected_machine.rs
  • crates/api-db/src/predicted_machine_interface.rs
  • crates/site-explorer/src/machine_creator.rs
  • crates/site-explorer/tests/zero_dpu.rs
  • crates/api-model/src/predicted_machine_interface.rs
  • crates/api-db/src/machine_interface.rs

Summary by CodeRabbit

  • Bug Fixes

    • Fixed primary network interface identification to persist correctly regardless of DHCP discovery order on multi-NIC machines.
  • Performance

    • Added database indexes to improve network interface lookup performance.

Walkthrough

A new primary_interface boolean field is added to PredictedMachineInterface (backed by a DB migration) and a declared_primary_mac() method is introduced on ExpectedMachineData. These primitives are wired through DHCP discovery, predicted-interface promotion (move_predicted_machine_interface_to_machine), and zero-DPU machine ingestion so that ExpectedHostNic.primary is authoritative regardless of DHCP arrival order.

Changes

Declared Primary NIC propagation through DHCP and ingestion

Layer / File(s) Summary
Schema migration and model contracts
crates/api-db/migrations/20260616215917_predicted_machine_interface_primary.sql, crates/api-model/src/predicted_machine_interface.rs, crates/api-model/src/expected_machine.rs
Adds primary_interface column (default false) and machine_id index to predicted_machine_interfaces; adds primary_interface: bool to PredictedMachineInterface and NewPredictedMachineInterface; introduces declared_primary_mac() on ExpectedMachineData to extract the declared primary MAC with unit tests.
DB persistence: predicted interface create and machine_interface reconciliation
crates/api-db/src/predicted_machine_interface.rs, crates/api-db/src/machine_interface.rs
Extends create to persist primary_interface; upgrades find_or_create_machine_interface with bidirectional primary-flag reconciliation; reworks move_predicted_machine_interface_to_machine to carry the prediction's declared primary value through creation and reconcile existing interfaces before association.
DHCP discovery: use declared_primary_mac()
crates/api-core/src/dhcp/discover.rs
Refactors discover_dhcp expected-host path to use declared_primary_mac() for the primary flag and iter().find() for NIC selection, eliminating the dual-purpose loop.
Zero-DPU ingestion: declared primary propagation
crates/site-explorer/src/machine_creator.rs
Derives declared_primary MAC from ExpectedMachineData, computes is_declared_primary per observed NIC, reconciles existing unowned interface rows to the declared primary before adoption, and stamps primary_interface on newly created predicted entries.
Integration tests, regression tests, and resolver documentation
crates/api-core/src/tests/expected_machine.rs, crates/api-core/src/handlers/bmc_endpoint_explorer.rs, crates/site-explorer/tests/zero_dpu.rs
Adds DHCP ordering regression test and two zero-DPU multi-NIC integration tests (no-declaration adoption and declared-primary promotion); updates BMC endpoint explorer documentation to clarify that primary_interface is carried but not consulted for boot-interface disambiguation, and updates test helper with explicit primary_interface: false.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(173, 216, 230, 0.5)
    Note over DHCPClient,machine_interfaces: DHCP Discovery Path
    DHCPClient->>discover_dhcp: DHCP packet (mac_address)
    discover_dhcp->>ExpectedMachineData: declared_primary_mac()
    ExpectedMachineData-->>discover_dhcp: Option~MacAddress~
    discover_dhcp->>find_or_create_machine_interface: is_primary = Some(mac == declared)
    find_or_create_machine_interface->>set_primary_interface: reconcile if differs
    find_or_create_machine_interface-->>discover_dhcp: machine_interface
  end
  rect rgba(144, 238, 144, 0.5)
    Note over SiteExplorer,machine_interfaces: Zero-DPU Ingestion Path
    SiteExplorer->>create_zero_dpu_machine: ExpectedMachineData + observed MACs
    create_zero_dpu_machine->>ExpectedMachineData: declared_primary_mac()
    ExpectedMachineData-->>create_zero_dpu_machine: Option~MacAddress~
    create_zero_dpu_machine->>set_primary_interface: reconcile existing unowned interface
    create_zero_dpu_machine->>predicted_machine_interface_create: primary_interface = is_declared_primary
    predicted_machine_interface_create-->>create_zero_dpu_machine: PredictedMachineInterface
    create_zero_dpu_machine->>move_predicted_machine_interface_to_machine: promote prediction
    move_predicted_machine_interface_to_machine->>set_primary_interface: reconcile if current_primary differs
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/infra-controller#2528: Updates bmc_endpoint_explorer.rs and predicted_machine_interface.rs in the same boot-interface resolver area; this PR adds primary_interface tracking to predictions and refines resolver documentation, which is adjacent to the boot-interface resolution refactor in the retrieved PR.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary objective: honoring declared primary host NIC when selecting boot device, which aligns with the main changes across the codebase.
Description check ✅ Passed The description comprehensively explains the changeset's objectives, scope, and addresses the multi-NIC pre-ingestion collision issue, with clear rationale for the deferred DPU-mode case.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@chet

chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai PTAL, thank you!

@chet

chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/api-core/src/handlers/bmc_endpoint_explorer.rs (1)

1320-1355: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update outdated comment about predictions and primary flag.

The comment at lines 1322-1323 states "Predictions hold no primary flag," which is no longer accurate after this PR introduces the primary_interface field on PredictedMachineInterface. Predictions now carry the declared primary intent. The comment should be revised to reflect that these specific test predictions are non-primary (as created by the predicted() helper with primary_interface: false), not that predictions categorically lack a primary flag.

📝 Suggested comment revision
-    // Predictions hold no primary flag, so with several (a report listing
-    // SuperNICs alongside the boot NIC) the declared intent is unknowable:
+    // When no prediction is marked primary, with several predictions (a report listing
+    // SuperNICs alongside the boot NIC) the declared intent is unknowable:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/handlers/bmc_endpoint_explorer.rs` around lines 1320 -
1355, Update the outdated comment in the test function
no_mac_multiple_predictions_refuse_to_guess_a_boot_device. The current comment
incorrectly states that predictions hold no primary flag, but the
PredictedMachineInterface struct now includes a primary_interface field. Revise
the comment to clarify that predictions do support a primary flag, but the
specific test predictions created by the predicted() helper function are marked
as non-primary, and the test verifies that resolution refuses to guess the boot
device when facing multiple non-primary predictions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/api-core/src/handlers/bmc_endpoint_explorer.rs`:
- Around line 1320-1355: Update the outdated comment in the test function
no_mac_multiple_predictions_refuse_to_guess_a_boot_device. The current comment
incorrectly states that predictions hold no primary flag, but the
PredictedMachineInterface struct now includes a primary_interface field. Revise
the comment to clarify that predictions do support a primary flag, but the
specific test predictions created by the predicted() helper function are marked
as non-primary, and the test verifies that resolution refuses to guess the boot
device when facing multiple non-primary predictions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 68671db3-5b89-481d-a88b-dd9d806611be

📥 Commits

Reviewing files that changed from the base of the PR and between 6d36a43 and 5c25a50.

📒 Files selected for processing (10)
  • crates/api-core/src/dhcp/discover.rs
  • crates/api-core/src/handlers/bmc_endpoint_explorer.rs
  • crates/api-core/src/tests/expected_machine.rs
  • crates/api-db/migrations/20260616215917_predicted_machine_interface_primary.sql
  • crates/api-db/src/machine_interface.rs
  • crates/api-db/src/predicted_machine_interface.rs
  • crates/api-model/src/expected_machine.rs
  • crates/api-model/src/predicted_machine_interface.rs
  • crates/site-explorer/src/machine_creator.rs
  • crates/site-explorer/tests/zero_dpu.rs

ExpectedHostNic.primary is meant to say "this NIC is the host's boot
interface," but ingestion ignored it: a zero-DPU or NIC-mode host's
predicted interfaces always promoted as non-primary, so the declared NIC
never stuck and the host booted via the lowest-MAC fallback until an
operator ran set-primary-interface. The declared primary is now
authoritative across the writers that own a NIC -- recorded on the
prediction so it survives promotion, and honored by the DHCP path.

This also settles the pre-ingestion window for multi-NIC hosts: several
NICs that lease before ingestion each default to primary, and adopting the
second one tripped the one_primary_interface_per_machine index and failed
the host's ingestion. Ingestion now reconciles the adopted rows so exactly
one interface is primary.

- Record the declared primary on predicted_machine_interfaces and resolve
  it through one ExpectedMachineData::declared_primary_mac() the writers
  share; promotion and the DHCP find-or-create path both honor it.
- Reconcile the adopted interfaces at zero-DPU ingestion to exactly one
  primary, ending the OnePrimaryInterface adoption failure.
- Nothing declared -> today's automation stands: DPU takeover during
  ingestion, else the pick_boot_interface lowest-MAC fallback.

Tests cover the declared primary surviving DHCP arrival order, promotion
landing it on the real row, multi-NIC adoption no longer colliding, and a
resolver unit test.

Part of NVIDIA#2657 (epic NVIDIA#2660). The genuine-DPU-mode case -- a declared
integrated NIC while the DPU stays managed -- is tracked separately in

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
NVIDIA#2668.
@chet chet force-pushed the honor-declared-primary-interface branch from 5c25a50 to 0d1d8e3 Compare June 18, 2026 07:02
@chet

chet commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai Thanks for the eyes. Updated -- revised the stale "predictions hold no primary flag" note in three spots (the resolve_admin_boot_interface_target doc + inline, and the no_mac_multiple_predictions_refuse_to_guess_a_boot_device test) to reflect that predictions now carry primary_interface but this resolver doesn't consult it yet. PTAL, thanks!

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@chet Acknowledged — the documentation corrections are the right call; keeping inline comments and test commentary accurate with the actual data model avoids misleading future contributors. Triggering a review of the updated changes now.

(ᵔᴥᵔ)🐇

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant