███████╗ ██████╗██╗ ██╗ ██████╗ ███████╗███████╗
██╔════╝██╔════╝██║ ██║██╔═══██╗██╔════╝██╔════╝
█████╗ ██║ ███████║██║ ██║█████╗ ███████╗
██╔══╝ ██║ ██╔══██║██║ ██║██╔══╝ ╚════██║
███████╗╚██████╗██║ ██║╚██████╔╝███████╗███████║
╚══════╝ ╚═════╝╚═╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝
Raw call recordings → on-device Android TTS · No cloud · No subscription · Forever
I had a thought one day: if I were to die right now, my loved ones would never hear my voice again.
Not a voicemail. Not a shaky video clip. Nothing intentional — just silence.
So I set out to fix that. I wanted to clone my voice from something I already had — a call recording, thirty seconds of me talking — and turn it into a model my family could carry on their phone forever. No internet. No company that might shut down. No subscription that lapses. Just my voice, still there, whenever they need it.
I found OmniVoice (zero-shot cloning from a short clip), then Piper (a real TTS model), then Sherpa-ONNX (fully on-device Android inference), then a TTS engine app on F-Droid that ties it all together. The pieces existed — they just weren't connected, and most of the official notebooks were broken.
So I connected them. Fixed the bugs. And built this.
A 7-notebook Google Colab pipeline that takes raw call recordings and produces a voice model that runs fully on-device on Android — no internet required after export.
All you need is ~30 seconds of someone's voice from any call recording.
Built on Piper VITS + Sherpa-ONNX. Every stage runs on the free Colab T4 GPU.
┌──────────────────────────────────────────────────────────────────────────────────┐
│ │
│ [Raw Recordings — any call, any format] │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 00 · Diarize│ pyannote.audio — speaker segmentation, resume-safe │
│ └──────┬──────┘ │
│ │ per-speaker WAV clips │
│ ▼ │
│ ┌─────────────┐ │
│ │ 01 · Clone │ OmniVoice zero-shot synthesis · faster-whisper transcription │
│ └──────┬──────┘ │
│ │ LJSpeech dataset (wavs/ + metadata.csv) │
│ ▼ │
│ ┌─────────────┐ │
│ │ 02 · Train │ Piper VITS fine-tune on hi_IN-rohan-medium │
│ └──────┬──────┘ │
│ │ last.ckpt + config.json → Google Drive │
│ ▼ │
│ ┌─────────────┐ │
│ │ 03 · Check │ Interactive PyTorch inference — validate before exporting │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 04 · Export │ .ckpt → .onnx (all upstream bugs fixed — see below) │
│ └──────┬──────┘ │
│ │ voice-package.tar.gz │
│ ▼ │
│ ┌─────────────┐ │
│ │ 05 · Verify │ ONNX Runtime inference — standard + streaming models │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 06 · Sherpa │ Metadata injection · tokens.txt · sherpa_model.tar.gz │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ [Android TTS — fully on-device, no internet, forever] │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
| # | Notebook | What it does | Key libraries |
|---|---|---|---|
| 0 | Diarize & Clip | Speaker diarization on raw recordings. Segments audio per speaker. Resume-safe progress tracking. | pyannote.audio 3.1, ffmpeg |
| 1 | Voice Clone → Dataset | Zero-shot synthesis of 100+ sentences in the cloned voice. Auto-transcription. LJSpeech output format. | OmniVoice, faster-whisper large-v3 |
| 2 | Dataset → Piper CKPT | Fine-tunes hi_IN-rohan-medium Piper base model. TensorBoard logging. Checkpoint export to Drive. |
Piper VITS, PyTorch Lightning, piper-phonemize-fix |
| 3 | Check the CKPT | Interactive inference on the raw PyTorch checkpoint. Validates voice quality before ONNX export. | Piper VITS, espeak-ng, ipywidgets |
| 4 | Export to ONNX | Exports .ckpt → .onnx. Rewrites broken upstream export scripts with all critical fixes applied. |
torch.onnx (opset 15), onnxscript |
| 5 | Check ONNX | Validates the exported model. Full encoder→decoder pipeline wired for streaming models. | onnxruntime, piper-phonemize-fix |
| 6 | Piper → Sherpa-ONNX | Injects Sherpa-ONNX metadata into .onnx. Generates tokens.txt. Packages for Android deployment. |
sherpa-onnx, onnx metadata API |
All of these were broken in the official Piper notebooks and diagnosed independently:
| # | Bug | Root cause | Fix |
|---|---|---|---|
| 1 | torch.onnx.export crash |
PyTorch 2.x defaults dynamo=True, which breaks dynamic_axes and None sid inputs |
Explicitly pass dynamo=False to force the legacy TorchScript exporter |
| 2 | CPU/CUDA device mismatch | Model loads to CUDA by default; dummy inputs are created on CPU — tracing fails | model_g = model_g.cpu() before export |
| 3 | Opset version wrong for Sherpa | Official notebook used opset_version=15 everywhere; Sherpa-ONNX requires opset 11 |
opset_version=11 on the Sherpa export path |
| 4 | PyTorch 2.6 checkpoint loading | Default changed to weights_only=True, breaking Lightning's complex checkpoint objects |
weights_only=False, strict=False in VitsModel.load_from_checkpoint |
| 5 | Colab dependency conflict | piper-phonemize has broken Colab deps |
Replaced with piper-phonemize-fix throughout |
| 6 | Streaming inference not wired | Inference script detected encoder/decoder pairs but threw "not yet supported" | Full encoder → decoder inference pipeline implemented |
| 7 | Config file not detected | ONNX inference only matched config.json, missing Piper's *.onnx.json naming |
Added *.onnx.json glob to detect_onnx_models |
My Drive/
└── Voicecloning/
├── raw_calls/ ← put your MP3 / WAV recordings here
├── clipped_audio/ ← Stage 0 output: per-speaker clips
└── training/
├── colab/piper/ ← checkpoints and config.json
└── piper-voice-packages/ ← exported .tar.gz voice packages
HuggingFace token — required for pyannote/speaker-diarization-3.1 (Stage 0). Accept the model licence on HuggingFace, then paste your token into the notebook's Colab form field. No .env file needed.
Colab GPU runtime — Stages 1 and 2 need a T4 or better. The free tier works; training runs 6–12 hours. Stages 3–6 can run on CPU.
Google Drive (~2 GB free) — checkpoints, datasets, and exports save to Drive so they survive session restarts.
The final sherpa_model.tar.gz can be used with any Sherpa-ONNX compatible Android application.
Recommended App for Testing:
- TTS Engine (F-Droid) — A lightweight, open-source TTS engine that supports Sherpa-ONNX.
Verified Devices:
- Redmi 9A
- Redmi K20 Pro
- Poco F1
| Layer | Tools |
|---|---|
| Diarization | pyannote.audio 3.1, ffmpeg |
| Voice synthesis | OmniVoice (k2-fsa), faster-whisper large-v3 |
| TTS architecture | VITS via Piper (rmcpantoja fork) |
| Training | PyTorch Lightning, piper-phonemize-fix, espeak-ng |
| Export | torch.onnx TorchScript path, opset 11/15, onnxscript |
| On-device inference | Sherpa-ONNX, onnxruntime |
| Platform | Google Colab (free T4 GPU) |
- rmcpantoja/piper — Piper training & inference notebooks (base, heavily modified)
- rhasspy/piper — Piper TTS core
- k2-fsa/OmniVoice — zero-shot voice synthesis
- k2-fsa/sherpa-onnx — on-device ONNX inference
Amit Basuri · github.com/PositiveMatician · New Delhi
Built because voices shouldn't disappear. All upstream bug fixes (PyTorch 2.6 compat, ONNX export, Sherpa-ONNX conversion) diagnosed and implemented independently.