-
Notifications
You must be signed in to change notification settings - Fork 496
[FEATURE] Voice interaction with agents via Gemini Multimodal Live API #1563
Description
Prerequisites
- I have searched the existing issues to make sure this feature has not been requested before
- I agree to follow the Code of Conduct
📝 Feature Summary
Add voice interaction support to kagent agents via the Gemini Multimodal Live API, allowing users to speak to agents and receive spoken responses through native audio processing with function calling bridging voice to the A2A protocol.
❓ Problem Statement / Motivation
Currently, kagent agents can only be interacted with via text (CLI, kagent UI, Slack integration). Voice interaction would:
- Enable hands-free operation for SREs during incident response (e.g., "What pods are crashing in production?")
- Make the platform more accessible to non-technical stakeholders who prefer natural conversation over typing commands
- Reduce friction for quick queries — speaking is faster than typing for many operations
The Gemini Multimodal Live API makes this feasible without building a separate STT/TTS pipeline, since it handles audio input/output natively while supporting function calling to bridge into the A2A protocol.
💡 Proposed Solution
A lightweight voice interface that connects Gemini Live API to any kagent agent via A2A:
Architecture:
Browser (mic/speaker)
│
├── WebSocket ──→ Gemini Live API (native audio in/out)
│ │
│ └── tool_call: "send_to_agent"
│ │
└── JS handler ──→ Backend proxy ──→ kagent A2A endpoint
Components:
- Frontend — Browser-based voice UI using Web Audio API for mic/speaker, WebSocket to Gemini Live API
- Backend — Lightweight proxy (FastAPI/Go) that forwards tool call requests from the browser to the kagent A2A endpoint, avoiding CORS and keeping API keys server-side
- Gemini Live session — Configured with a
send_to_agentfunction declaration so Gemini routes platform/K8s requests through A2A while handling greetings and simple questions directly
Key details:
- Audio: 16kHz PCM input, 24kHz PCM output, with input/output transcription for visual display
- Function calling: Gemini Live supports synchronous tool use — when the user asks for a K8s action, Gemini calls the tool, waits for the A2A response, and speaks the result naturally
- Model:
gemini-2.5-flash-live-preview(low latency, supports audio + tools) - Could be deployed standalone or integrated into kagent UI as a voice mode toggle
🔄 Alternatives Considered
- Browser Web Speech API — Free but lower quality STT/TTS, requires a separate text-to-A2A bridge, no native function calling
- Google Cloud Speech + TTS — High quality but adds two separate services to manage, more latency from STT→text→LLM→text→TTS pipeline
- OpenAI Realtime API — Good quality but adds an OpenAI dependency when kagent already uses Gemini
Gemini Live is the best fit because it eliminates the STT/TTS pipeline entirely and natively supports function calling to bridge into A2A.
🎯 Affected Service(s)
UI Service
📚 Additional Context
- Gemini Multimodal Live API docs
- Live API Tool Use
- google-gemini/live-api-web-console — React starter from Google
- We have a working prototype in the Krateo autopilot-kagent repo (
autopilot/voice-ui/) that demonstrates this approach with a FastAPI backend + vanilla JS frontend
🙋 Are you willing to contribute?
- I would be willing to submit a PR to implement this feature
Metadata
Metadata
Assignees
Labels
Type
Projects
Status