Skip to content

[bug/5614743] fix: scout repeatedly fails machine discovery with 'AttestKeyInfo is not populated' error#2670

Draft
prbinu-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
prbinu-nvidia:bug/5614743
Draft

[bug/5614743] fix: scout repeatedly fails machine discovery with 'AttestKeyInfo is not populated' error#2670
prbinu-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
prbinu-nvidia:bug/5614743

Conversation

@prbinu-nvidia

Copy link
Copy Markdown
Contributor

Description

Theforge-scout repeatedly fails machine discovery against carbide-api with the hosts remain stuck in bootingwithdiscoveryimage state, and with the error:.

AttestKeyInfo is not populated

Manual workaround: Run tpm2_clear (or equivalent TPM clear) and reboot the host. Discovery then succeeds.

Root cause: In register.rs, scout decided whether it was running on a DPU vs. a managed host using TPM EK certificate presence:

let is_dpu = hardware_info.tpm_ek_certificate.is_none();

On affected DGX hosts, hardware enumeration can leave tpm_ek_certificate unset even though the machine is a normal x86 host (not a BlueField DPU). Scout then treated the host as a DPU, skipped attestation key setup (create_attest_key_info), and sent registration data without AttestKeyInfo. carbide-api correctly rejected the request.

Impact: Host discovery cannot complete without operator intervention (TPM clear + reboot). Affected platforms include DGX H100/GB200 class systems where EK cert enumeration is missing or incomplete.

Fixes Applied

  1. correct host vs. DPU detection (register.rs): Replaced TPM-based DPU inference with SMBIOS platform detection. Hosts without an EK certificate are no longer misclassified as DPUs. Attestation key creation runs on managed hosts as intended, so AttestKeyInfo is populated before registration.
  2. Shared platform helper (platform.rs, new)
    Added platform::is_host() that reads SMBIOS system information and returns false when the product name contains "bluefield" (DPU), true otherwise.
  3. automated TPM self-heal (tpm.rs, register.rs) For genuine TPM corruption, added optional one-shot recovery when local TPM setup fails. Recovery (tpm2_clear and reboot) is invoked from register.rs when create_context_from_path or create_attest_key_info fails with a recoverable TPM error.

Files changed

File Change
crates/scout/src/register.rs Fix is_dpu logic; wire TPM recovery on setup failure
crates/scout/src/platform.rs New shared SMBIOS-based is_host()
crates/scout/src/tpm.rs Add recovery helpers and tests
crates/scout/src/deprovision/scrabbing.rs Use shared platform::is_host()
crates/scout/src/main.rs Add mod platform

Expected outcome

  • DGX H100/GB200 hosts with missing EK cert material proceed through attestation and discovery without manual TPM clear.
  • If TPM state is genuinely corrupted and local setup fails, scout attempts one automated TPM2_Clear + reboot before giving up.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 53b85890-8b3c-42d0-bf32-f8afe8eb6501

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant