Skip to content

docs: add 8 new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration (#416)#426

Open
mesutoezdil wants to merge 1 commit into
Project-HAMi:masterfrom
mesutoezdil:docs/faq-entries-416
Open

docs: add 8 new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration (#416)#426
mesutoezdil wants to merge 1 commit into
Project-HAMi:masterfrom
mesutoezdil:docs/faq-entries-416

Conversation

@mesutoezdil
Copy link
Copy Markdown
Contributor

@mesutoezdil mesutoezdil commented May 29, 2026

Adds 8 new FAQ entries to docs/faq/faq.md covering the three topic areas defined in the issue. All questions were sourced from the research compiled in #415.

New entries

GPU virtualization model

  • How does HAMi enforce GPU memory and compute limits? Explains the libvgpu.so CUDA API interception mechanism, what it covers, and what it does not (DinD, direct driver API calls). Links to GPU Virtualization.
  • HAMi vGPU vs NVIDIA MIG. Side-by-side comparison table covering hardware requirements, isolation mechanism, enforcement strength, granularity, and dynamic reconfiguration. Guidance on when to use each.
  • Why does nvidia-smi inside a container show less memory than the host? Explains that this is intentional - libvgpu.so intercepts memory query calls and returns the allocated limit.
  • Why is my gpumem limit not enforced? Covers the four root causes: CUDA_DISABLE_CONTROL, Docker-in-Docker, direct NVML/driver API calls, and misconfigured container runtime.

Scheduling interaction

  • Does HAMi replace kube-scheduler or run alongside it? Explains the extender model, the MutatingWebhook schedulerName assignment, and the impact on non-HAMi pods (none). Includes a note on multi-replica leader election.

Ecosystem integration

  • HAMi with vLLM multi-GPU tensor parallelism. Documents the NCCL segfault issue (CUDA_DEVICE_MEMORY_SHARED_CACHE per-container, fixed in v2.7.0), single-GPU usage, and Volcano multi-pod setup. Links to issues #1764 and #1853.
  • HAMi with NVIDIA GPU Operator and DCGM. Explains the device plugin conflict and how to disable GPU Operator's device plugin. Notes that DCGM Exporter is unaffected.
  • Prometheus and Grafana monitoring. Covers the metrics endpoint, key metric names, scrape config, and importing the bundled static/grafana/gpu-dashboard.json dashboard.

Closes #416.
Refs #415.

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented May 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mesutoezdil
Once this PR has been reviewed and has the lgtm label, please assign windsonsea for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link
Copy Markdown

netlify Bot commented May 29, 2026

Deploy Preview for project-hami ready!

Name Link
🔨 Latest commit 337648d
🔍 Latest deploy log https://app.netlify.com/projects/project-hami/deploys/6a2135240832dd00082fe628
😎 Deploy Preview https://deploy-preview-426--project-hami.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@hami-robot hami-robot Bot added the size/L label May 29, 2026
@mesutoezdil
Copy link
Copy Markdown
Contributor Author

done @rootsongjc

@mesutoezdil mesutoezdil force-pushed the docs/faq-entries-416 branch from 24a8fb2 to 359c2cc Compare June 4, 2026 07:13
@rootsongjc
Copy link
Copy Markdown
Contributor

I think this article as an FAQ might be too long.

@rootsongjc
Copy link
Copy Markdown
Contributor

And some of the FAQs could be added to the Concept document, or to other documents, or referenced from existing documents on websites. Instead of putting it all in the FAQ, which makes it difficult to maintain later on.

@mesutoezdil mesutoezdil force-pushed the docs/faq-entries-416 branch from 359c2cc to c03cd3c Compare June 4, 2026 08:09
@hami-robot hami-robot Bot added size/M and removed size/L labels Jun 4, 2026
@mesutoezdil
Copy link
Copy Markdown
Contributor Author

And some of the FAQs could be added to the Concept document, or to other documents, or referenced from existing documents on websites. Instead of putting it all in the FAQ, which makes it difficult to maintain later on.

ok now?

… pages

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
@mesutoezdil mesutoezdil force-pushed the docs/faq-entries-416 branch from c03cd3c to 337648d Compare June 4, 2026 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[docs/faq] Write new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration

2 participants