[FEATURE] Voice interaction with agents via Gemini Multimodal Live API

### Prerequisites

- [x] I have searched the existing issues to make sure this feature has not been requested before
- [x] I agree to follow the Code of Conduct

### 📝 Feature Summary

Add voice interaction support to kagent agents via the Gemini Multimodal Live API, allowing users to speak to agents and receive spoken responses through native audio processing with function calling bridging voice to the A2A protocol.

### ❓ Problem Statement / Motivation

Currently, kagent agents can only be interacted with via text (CLI, kagent UI, Slack integration). Voice interaction would:

- Enable hands-free operation for SREs during incident response (e.g., "What pods are crashing in production?")
- Make the platform more accessible to non-technical stakeholders who prefer natural conversation over typing commands
- Reduce friction for quick queries — speaking is faster than typing for many operations

The Gemini Multimodal Live API makes this feasible without building a separate STT/TTS pipeline, since it handles audio input/output natively while supporting function calling to bridge into the A2A protocol.

### 💡 Proposed Solution

A lightweight voice interface that connects Gemini Live API to any kagent agent via A2A:

**Architecture:**
```
Browser (mic/speaker)
    │
    ├── WebSocket ──→ Gemini Live API (native audio in/out)
    │                     │
    │                     └── tool_call: "send_to_agent"
    │                              │
    └── JS handler ──→ Backend proxy ──→ kagent A2A endpoint
```

**Components:**
1. **Frontend** — Browser-based voice UI using Web Audio API for mic/speaker, WebSocket to Gemini Live API
2. **Backend** — Lightweight proxy (FastAPI/Go) that forwards tool call requests from the browser to the kagent A2A endpoint, avoiding CORS and keeping API keys server-side
3. **Gemini Live session** — Configured with a `send_to_agent` function declaration so Gemini routes platform/K8s requests through A2A while handling greetings and simple questions directly

**Key details:**
- Audio: 16kHz PCM input, 24kHz PCM output, with input/output transcription for visual display
- Function calling: Gemini Live supports synchronous tool use — when the user asks for a K8s action, Gemini calls the tool, waits for the A2A response, and speaks the result naturally
- Model: `gemini-2.5-flash-live-preview` (low latency, supports audio + tools)
- Could be deployed standalone or integrated into kagent UI as a voice mode toggle

### 🔄 Alternatives Considered

- **Browser Web Speech API** — Free but lower quality STT/TTS, requires a separate text-to-A2A bridge, no native function calling
- **Google Cloud Speech + TTS** — High quality but adds two separate services to manage, more latency from STT→text→LLM→text→TTS pipeline
- **OpenAI Realtime API** — Good quality but adds an OpenAI dependency when kagent already uses Gemini

Gemini Live is the best fit because it eliminates the STT/TTS pipeline entirely and natively supports function calling to bridge into A2A.

### 🎯 Affected Service(s)

UI Service

### 📚 Additional Context

- [Gemini Multimodal Live API docs](https://ai.google.dev/gemini-api/docs/multimodal-live)
- [Live API Tool Use](https://ai.google.dev/gemini-api/docs/live-api/tools)
- [google-gemini/live-api-web-console](https://github.com/google-gemini/live-api-web-console) — React starter from Google
- We have a working prototype in the Krateo autopilot-kagent repo (`autopilot/voice-ui/`) that demonstrates this approach with a FastAPI backend + vanilla JS frontend

### 🙋 Are you willing to contribute?

- [x] I would be willing to submit a PR to implement this feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Voice interaction with agents via Gemini Multimodal Live API #1563

Prerequisites

📝 Feature Summary

❓ Problem Statement / Motivation

💡 Proposed Solution

🔄 Alternatives Considered

🎯 Affected Service(s)

📚 Additional Context

🙋 Are you willing to contribute?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Voice interaction with agents via Gemini Multimodal Live API #1563

Description

Prerequisites

📝 Feature Summary

❓ Problem Statement / Motivation

💡 Proposed Solution

🔄 Alternatives Considered

🎯 Affected Service(s)

📚 Additional Context

🙋 Are you willing to contribute?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions