From 9804dd6ec50dfe4c3ae5e5855ac73234ce75e238 Mon Sep 17 00:00:00 2001 From: mesutoezdil Date: Fri, 12 Jun 2026 12:05:56 +0200 Subject: [PATCH] blog: add kagent and HAMi GPU virtualization article (EN + ZH) Signed-off-by: mesutoezdil --- blog/authors.yml | 5 + .../index.md | 305 ++++++++++++++++++ .../authors.yml | 5 + .../index.md | 305 ++++++++++++++++++ 4 files changed, 620 insertions(+) create mode 100644 blog/kagent-hami-ai-agent-gpu-virtualization/index.md create mode 100644 i18n/zh/docusaurus-plugin-content-blog/kagent-hami-ai-agent-gpu-virtualization/index.md diff --git a/blog/authors.yml b/blog/authors.yml index 4d63a2ba..1321f0af 100644 --- a/blog/authors.yml +++ b/blog/authors.yml @@ -5,3 +5,8 @@ elrond_wang: hami_community: name: HAMi Community + +mesut_oezdil: + name: Mesut Oezdil + title: Author + url: https://www.linkedin.com/in/mesut-oezdil/ diff --git a/blog/kagent-hami-ai-agent-gpu-virtualization/index.md b/blog/kagent-hami-ai-agent-gpu-virtualization/index.md new file mode 100644 index 00000000..d93d5b8d --- /dev/null +++ b/blog/kagent-hami-ai-agent-gpu-virtualization/index.md @@ -0,0 +1,305 @@ +--- +title: "Validating AI Agent-Driven GPU Management on Kubernetes with HAMi and kagent" +date: "2026-05-28" +description: "A real-world test of kagent and HAMi: one physical GPU virtualized into 10 vGPUs, an AI Agent managing Kubernetes workloads via CRDs, and Agent-to-Agent collaboration - all running on open-source models." +authors: [mesut_oezdil] +tags: ["HAMi", "kagent", "GPU Virtualization", "AI Agent", "Kubernetes", "vGPU", "Cloud Native"] +--- + +Source: [mesutoezdil.substack.com](https://mesutoezdil.substack.com/p/kagent-hami-on-nebius-2-cncf-projects) +GitHub Repo: [kagentWithHami](https://github.com/mesutoezdil/kagentWithHami) +Chinese translation by Jimmy Song, originally published on [WeChat](https://mp.weixin.qq.com/s/WNzZh02_1CbMbVBfi4eRGw) + +--- + +One physical NVIDIA L40S virtualized into 10 vGPUs with HAMi. An AI Agent deployed as a Kubernetes CRD via kagent. Agent-to-Agent delegation, GPU pod creation, overcommit protection - all driven by Llama 3.3 70B with no closed-source dependencies. + + + +## Before We Start + +This is not a documentation summary. + +Every command you see below was executed by me personally on a Nebius VM. Every output is from that machine. + +When things failed, I debugged them. When things worked, I explain why they worked. The errors in this article are real errors; the fixes are fixes I verified myself. + +If you run these commands in the same environment, you will get the same results. + +Complete repository (all manifests and setup script): + +https://github.com/mesutoezdil/kagentWithHami + +Scope note: this article covers the core parts. The full installation flow, all manifests, complete troubleshooting guide, and setup script are in the GitHub repository. If you want to reproduce this, start there. + +If you haven't worked with HAMi before: + +https://medium.com/@mesutoezdil/hami-in-a-real-kubernetes-environment-e8eaa872f388 + +If you want to see GPU observability tooling tests: + +https://mesutoezdil.substack.com/p/i-tested-every-feature-of-ingero + +## What This Article Is Actually About + +kagent turns AI Agents into Kubernetes resources. + +Your system prompt, tools, and model config all exist as CRDs. + +You can: + +- Version-control them with Git +- Deploy them with Helm +- Inspect them with kubectl + +HAMi implements GPU virtualization at the Kubernetes scheduler layer. + +One physical NVIDIA L40S becomes 10 virtual GPUs in Kubernetes, with strict VRAM limits enforced at the CUDA Driver level. + +Nebius Token Factory is an OpenAI-compatible inference service. + +All tests in this article use Llama 3.3 70B. + +The question I wanted to answer: + +> "Can an AI Agent, running inside a Kubernetes cluster, using only open-source models, manage GPU-virtualized workloads?" + +The answer is yes. + +## Test Machine + +``` +GPU: 1x NVIDIA L40S (46GB VRAM) +CPU: 8 vCPUs +RAM: 32GB +OS: Ubuntu 24.04 LTS for NVIDIA GPUs (CUDA 13) +``` + +``` +nvidia-smi +| NVIDIA-SMI 580.126.09 CUDA Version: 13.0 | +| 0 NVIDIA L40S 0MiB / 46068MiB 0% | +``` + +46GB VRAM. Completely idle. + +By the end of this article, it becomes 10 virtual GPUs. + +## 1. Install k3s and Helm + +k3s is the right choice for a single-node environment. + +```bash +curl -sfL https://get.k3s.io | sh - +``` + +(Subsequent commands follow as in the repository; full walkthrough in the GitHub repo.) + +## 2. Install kagent + +kagent ships two Helm charts. + +Install the CRDs first, then the main chart. + +This lets you upgrade CRDs independently without affecting running Agents. + +```bash +helm install kagent-crds \ + oci://ghcr.io/kagent-dev/kagent/helm/kagent-crds \ + --namespace kagent +``` + +Then install the main chart, wired to the Nebius Token Factory endpoint. + +## 3. Install HAMi + +Without HAMi, Kubernetes sees no GPU at all: + +```json +{"cpu": "8", "memory": "32865164Ki", "pods": "110"} +``` + +No `nvidia.com/gpu`. + +After installing HAMi: + +```json +{ + "cpu": "8", + "memory": "32865164Ki", + "nvidia.com/gpu": "10", + "pods": "110" +} +``` + +One physical GPU, virtualized into 10. + +## 4. First Agent Call + +The LLM automatically: + +- Calls the Kubernetes API +- Fetches resources +- Summarizes the result + +Final output: + +> "The cluster has 25 running pods across different namespaces, including kagent and kube-system." + +## 5. GPU Check + +Before HAMi: + +> "The node does not have any GPUs available." + +After HAMi: + +> "The node nebius-tarantula has 10 GPUs available, type NVIDIA L40S." + +The Agent reads and understands HAMi's Kubernetes annotations. + +## 6. Self-Inspection Test + +The Agent describes itself using the Kubernetes API. + +It: + +- Finds its own CRD +- Reads its own system prompt +- Reads its own tool list +- Explains its own architecture + +An Agent reading and explaining its own definition via live API calls. + +## 7. Create a Custom Agent + +Created an SRE orchestrator that delegates metrics queries to a `promql-agent`. + +The key mechanism: + +```yaml +type: Agent +``` + +This enables Agent-to-Agent (A2A) delegation. + +## 8. Agent Talks to Agent + +Two separate Agents with: + +- Independent sessions +- Independent context windows +- Independent PostgreSQL storage + +The orchestrator sees only the sub-agent's final result, not its internal reasoning. + +## 9. Agent Creates a HAMi GPU Pod + +The Agent automatically creates a pod with: + +```yaml +annotations: + nvidia.com/gpumem: "20000" +``` + +Then: + +- First pod allocated 20,000 MiB +- Second pod allocated 15,000 MiB + +Both pods co-scheduled to the same physical GPU. + +HAMi handles GPU sharing correctly. + +## 10. Overcommit Protection + +When requesting: + +```yaml +nvidia.com/gpu: 11 +``` + +but the cluster only has 10 virtual GPUs: + +``` +Warning FailedScheduling hami-scheduler +``` + +The pod stays Pending. + +HAMi does not schedule requests it cannot satisfy. + +## 11. HAMi Metrics + +HAMi exposes standard Prometheus metrics: + +- `HostCoreUtilization` +- `HostGPUMemoryUsage` +- `hami_build_info` + +Plugs directly into existing monitoring stacks. + +## 12. kagent CLI + +The kagent CLI shows: + +- Agents +- Sessions +- A2A sub-sessions +- Delegation latency + +All state stored in PostgreSQL. + +**A2A Agent Card** + +Every Agent exposes: + +``` +/.well-known/agent-card.json +``` + +Used for capability discovery in multi-agent systems. + +## What Did Not Work + +**Memory CRD** - only Pinecone is supported right now. + +**kmcp init** - not available in v0.8.6. + +**Ubuntu + HAMi + sleep** - if the image is missing CUDA libraries, even a `sleep` container fails to start. + +**HAMi WebUI** - requires a separate installation step. + +## Why This Combination Makes Sense + +Your deployment specs live in Git. + +Your network policies live in Git. + +Your RBAC rules live in Git. + +Why shouldn't your AI Agent's system prompts? + +kagent makes that possible. + +HAMi solves GPU resource waste without modifying workloads. + +Together: + +An AI Agent can observe, understand, and manage GPU-virtualized infrastructure from inside a Kubernetes cluster. + +And it does this: + +- Using open-source models +- Without depending on closed-source AI providers +- Running entirely inside Kubernetes + +--- + +## Summary + +One NVIDIA L40S, split into 10 virtual GPUs by HAMi. An AI Agent deployed as a Kubernetes CRD via kagent. A2A delegation across independent sessions. All running on an open-source model with no closed-source dependencies. + +The combination works end to end: the Agent reads HAMi annotations, schedules GPU pods, detects overcommit, and queries Prometheus metrics - entirely from inside the cluster. + +Full manifests and setup script: [github.com/mesutoezdil/kagentWithHami](https://github.com/mesutoezdil/kagentWithHami) diff --git a/i18n/zh/docusaurus-plugin-content-blog/authors.yml b/i18n/zh/docusaurus-plugin-content-blog/authors.yml index 59e8e85d..e8d9fb9a 100644 --- a/i18n/zh/docusaurus-plugin-content-blog/authors.yml +++ b/i18n/zh/docusaurus-plugin-content-blog/authors.yml @@ -5,3 +5,8 @@ elrond_wang: hami_community: name: HAMi 社区 + +mesut_oezdil: + name: Mesut Oezdil + title: Author + url: https://www.linkedin.com/in/mesut-oezdil/ diff --git a/i18n/zh/docusaurus-plugin-content-blog/kagent-hami-ai-agent-gpu-virtualization/index.md b/i18n/zh/docusaurus-plugin-content-blog/kagent-hami-ai-agent-gpu-virtualization/index.md new file mode 100644 index 00000000..8481e2a9 --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-blog/kagent-hami-ai-agent-gpu-virtualization/index.md @@ -0,0 +1,305 @@ +--- +title: "验证 AI Agent 驱动的 Kubernetes GPU 管理:基于 HAMi 与 kagent 的实践" +date: "2026-05-28" +description: "AI Agent 已开始直接管理 Kubernetes GPU 资源:实测 kagent + HAMi 如何实现 GPU 虚拟化、Agent 协作与开源模型驱动的 AI Infra。" +authors: [mesut_oezdil] +tags: ["HAMi", "kagent", "GPU 虚拟化", "AI Agent", "Kubernetes", "vGPU", "云原生"] +--- + +- **作者:** [Mesut Oezdil](https://www.linkedin.com/in/mesut-oezdil/) / [GitHub](https://github.com/mesutoezdil) +- **原文:** [mesutoezdil.substack.com](https://mesutoezdil.substack.com/p/kagent-hami-on-nebius-2-cncf-projects) +- **GitHub Repo:** [kagentWithHami](https://github.com/mesutoezdil/kagentWithHami) +- **中文翻译:** Jimmy Song(原文发布于[微信公众号](https://mp.weixin.qq.com/s/WNzZh02_1CbMbVBfi4eRGw)) + +--- + + + +## 在开始之前 + +这不是一篇"文档总结"。 + +你在下面看到的每一条命令,都是我亲自在 Nebius VM 上执行的。每一个输出结果,也都来自那台机器。 + +当某些东西失败时,我会去调试;当某些东西成功时,我会解释为什么能成功。文章中的错误都是真实遇到的错误,修复方法也都是我亲自验证过的方案。 + +如果你使用同样的环境运行这些命令,你会得到相同的结果。 + +完整仓库(包括所有 manifests 与 setup script)在这里: + +https://github.com/mesutoezdil/kagentWithHami + +关于本文范围:这篇文章只覆盖核心部分。完整安装流程、所有 manifests、完整 troubleshooting guide 与 setup script 都在 GitHub 仓库中。如果你想自己跑一遍,建议先从仓库开始。 + +如果你之前没接触过 HAMi: + +https://medium.com/@mesutoezdil/hami-in-a-real-kubernetes-environment-e8eaa872f388 + +如果你想看 GPU 可观测性工具测试: + +https://mesutoezdil.substack.com/p/i-tested-every-feature-of-ingero + +## 这篇文章到底在讲什么 + +kagent 会把 AI Agent 变成 Kubernetes 资源。 + +你的 system prompt、tools、model config,全部都以 CRD 的形式存在。 + +你可以: + +- 用 Git 管理版本 +- 用 Helm 部署 +- 用 kubectl 查看 + +HAMi 则是在 Kubernetes scheduler 层实现 GPU 虚拟化。 + +一张物理 NVIDIA L40S,可以在 Kubernetes 中变成 10 张虚拟 GPU,并且在 CUDA Driver 层实现严格的显存限制。 + +Nebius Token Factory 是一个兼容 OpenAI API 的推理服务。 + +本文所有测试都使用的是 Llama 3.3 70B。 + +我想验证的问题是: + +> "一个 AI Agent,是否能够在 Kubernetes 集群内部,仅使用开源模型,就管理 GPU 虚拟化工作负载?" + +答案是: + +可以。 + +## 测试机器 + +``` +GPU: 1x NVIDIA L40S (46GB VRAM) +CPU: 8 vCPUs +RAM: 32GB +OS: Ubuntu 24.04 LTS for NVIDIA GPUs (CUDA 13) +``` + +``` +nvidia-smi +| NVIDIA-SMI 580.126.09 CUDA Version: 13.0 | +| 0 NVIDIA L40S 0MiB / 46068MiB 0% | +``` + +46GB VRAM 完全空闲。 + +而在本文结束时,它会被变成 10 张虚拟 GPU。 + +## 1. 安装 k3s 与 Helm + +k3s 是单节点环境的理想选择。 + +```bash +curl -sfL https://get.k3s.io | sh - +``` + +(后续命令与原文一致,此处保留) + +## 2. 安装 kagent + +kagent 提供两个 Helm Chart。 + +先安装 CRD,再安装主 Chart。 + +这样你就可以独立升级 CRD,而不会影响正在运行的 Agent。 + +```bash +helm install kagent-crds \ + oci://ghcr.io/kagent-dev/kagent/helm/kagent-crds \ + --namespace kagent +``` + +之后安装主 Chart,并接入 Nebius Token Factory。 + +## 3. 安装 HAMi + +如果没有 HAMi,Kubernetes 根本看不到 GPU: + +```json +{"cpu": "8", "memory": "32865164Ki", "pods": "110"} +``` + +没有 nvidia.com/gpu。 + +HAMi 安装完成后: + +```json +{ + "cpu": "8", + "memory": "32865164Ki", + "nvidia.com/gpu": "10", + "pods": "110" +} +``` + +一张物理 GPU 被虚拟化成了 10 张 GPU。 + +## 4. 第一次 Agent 调用 + +LLM 会自动: + +- 调用 Kubernetes API +- 获取资源 +- 汇总结果 + +最终输出: + +> "The cluster has 25 running pods across different namespaces, including kagent and kube-system." + +## 5. GPU 检查 + +安装 HAMi 前: + +> "The node does not have any GPUs available." + +安装 HAMi 后: + +> "The node nebius-tarantula has 10 GPUs available, type NVIDIA L40S." + +Agent 可以读取并理解 HAMi 的 Kubernetes annotations。 + +## 6. Self-inspection 测试 + +Agent 使用 Kubernetes API 描述它自己。 + +它: + +- 找到自己的 CRD +- 读取自己的 system prompt +- 读取自己的 tool list +- 解释自己的架构 + +一个 Agent,通过实时 API 调用,读取并解释自己的定义。 + +## 7. 创建自定义 Agent + +创建了一个 SRE orchestrator,并将 metrics 查询委托给 promql-agent。 + +核心机制: + +```yaml +type: Agent +``` + +这实现了 Agent-to-Agent(A2A)。 + +## 8. Agent 与 Agent 对话 + +两个不同 Agent 拥有: + +- 独立 session +- 独立 context window +- 独立 PostgreSQL 存储 + +orchestrator 只能看到子 Agent 的最终结果,而无法看到内部 reasoning。 + +## 9. Agent 创建 HAMi GPU Pod + +Agent 自动创建: + +```yaml +annotations: + nvidia.com/gpumem: "20000" +``` + +随后: + +- 第一个 Pod 分配 20000 MiB +- 第二个 Pod 分配 15000 MiB + +两个 Pod 被共享调度到同一张物理 GPU。 + +HAMi 会正确处理 GPU sharing。 + +## 10. Overcommit 保护机制 + +当请求: + +```yaml +nvidia.com/gpu: 11 +``` + +但集群只有 10 张虚拟 GPU 时: + +``` +Warning FailedScheduling hami-scheduler +``` + +Pod 会一直 Pending。 + +HAMi 不会调度无法满足的资源请求。 + +## 11. HAMi Metrics + +HAMi 提供标准 Prometheus metrics: + +- HostCoreUtilization +- HostGPUMemoryUsage +- hami_build_info + +可以直接接入现有监控系统。 + +## 12. kagent CLI + +kagent CLI 可以查看: + +- Agent +- Session +- A2A sub-session +- Delegation latency + +所有状态都存储在 PostgreSQL 中。 + +**A2A Agent Card** + +每个 Agent 都会暴露: + +``` +/.well-known/agent-card.json +``` + +用于 multi-agent system 中的能力发现。 + +## 哪些东西没有成功 + +**Memory CRD** - 目前只支持 Pinecone。 + +**kmcp init** - v0.8.6 中不可用。 + +**Ubuntu + HAMi + sleep** - 如果镜像缺少 CUDA libraries,甚至 sleep 命令都无法启动。 + +**HAMi WebUI** - 需要额外安装。 + +## 为什么这个组合有意义 + +你的 deployment specs 在 Git 中。 + +你的 network policies 在 Git 中。 + +你的 RBAC rules 也在 Git 中。 + +那么 AI Agent 的 system prompts,为什么不能也在 Git 中? + +kagent 实现了这一点。 + +HAMi 则在不修改 workload 的前提下,解决 GPU 资源浪费问题。 + +这两个项目组合后: + +AI Agent 可以直接在 Kubernetes 集群内部观察、理解并管理 GPU 虚拟化基础设施。 + +而且: + +- 使用开源模型 +- 不依赖闭源 AI Provider +- 完全运行在 Kubernetes 内部 + +--- + +一张 NVIDIA L40S,通过 HAMi 被切分为 10 张虚拟 GPU;一个通过 kagent 以 Kubernetes CRD 形式部署的 AI Agent;跨独立会话的 A2A(Agent-to-Agent)委托机制;整个系统运行在开源模型之上,不依赖任何闭源组件。 + +这套组合实现了完整的端到端闭环:Agent 能够读取 HAMi 的注解信息、调度 GPU Pod、检测资源超额申请(overcommit),并查询 Prometheus 指标,所有操作都完全在 Kubernetes 集群内部完成。 + +完整的部署清单(manifests)和安装脚本请见: +[github.com/mesutoezdil/kagentWithHami](https://github.com/mesutoezdil/kagentWithHami)