Skip to content

PositiveMatician/Echoes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

███████╗ ██████╗██╗  ██╗ ██████╗ ███████╗███████╗
██╔════╝██╔════╝██║  ██║██╔═══██╗██╔════╝██╔════╝
█████╗  ██║     ███████║██║   ██║█████╗  ███████╗
██╔══╝  ██║     ██╔══██║██║   ██║██╔══╝  ╚════██║
███████╗╚██████╗██║  ██║╚██████╔╝███████╗███████║
╚══════╝ ╚═════╝╚═╝  ╚═╝ ╚═════╝ ╚══════╝╚══════╝

Leave your voice behind — for the people who love you

Raw call recordings → on-device Android TTS · No cloud · No subscription · Forever

Colab PyTorch ONNX Python License


Why this exists

I had a thought one day: if I were to die right now, my loved ones would never hear my voice again.

Not a voicemail. Not a shaky video clip. Nothing intentional — just silence.

So I set out to fix that. I wanted to clone my voice from something I already had — a call recording, thirty seconds of me talking — and turn it into a model my family could carry on their phone forever. No internet. No company that might shut down. No subscription that lapses. Just my voice, still there, whenever they need it.

I found OmniVoice (zero-shot cloning from a short clip), then Piper (a real TTS model), then Sherpa-ONNX (fully on-device Android inference), then a TTS engine app on F-Droid that ties it all together. The pieces existed — they just weren't connected, and most of the official notebooks were broken.

So I connected them. Fixed the bugs. And built this.


What Echoes is

A 7-notebook Google Colab pipeline that takes raw call recordings and produces a voice model that runs fully on-device on Android — no internet required after export.

All you need is ~30 seconds of someone's voice from any call recording.

Built on Piper VITS + Sherpa-ONNX. Every stage runs on the free Colab T4 GPU.


Pipeline

┌──────────────────────────────────────────────────────────────────────────────────┐
│                                                                                  │
│   [Raw Recordings — any call, any format]                                        │
│         │                                                                        │
│         ▼                                                                        │
│   ┌─────────────┐                                                                │
│   │  00 · Diarize│  pyannote.audio — speaker segmentation, resume-safe           │
│   └──────┬──────┘                                                                │
│          │  per-speaker WAV clips                                                │
│          ▼                                                                       │
│   ┌─────────────┐                                                                │
│   │  01 · Clone  │  OmniVoice zero-shot synthesis · faster-whisper transcription │
│   └──────┬──────┘                                                                │
│          │  LJSpeech dataset (wavs/ + metadata.csv)                             │
│          ▼                                                                       │
│   ┌─────────────┐                                                                │
│   │  02 · Train  │  Piper VITS fine-tune on hi_IN-rohan-medium                  │
│   └──────┬──────┘                                                                │
│          │  last.ckpt + config.json → Google Drive                               │
│          ▼                                                                       │
│   ┌─────────────┐                                                                │
│   │  03 · Check  │  Interactive PyTorch inference — validate before exporting    │
│   └──────┬──────┘                                                                │
│          │                                                                       │
│          ▼                                                                       │
│   ┌─────────────┐                                                                │
│   │  04 · Export │  .ckpt → .onnx  (all upstream bugs fixed — see below)        │
│   └──────┬──────┘                                                                │
│          │  voice-package.tar.gz                                                 │
│          ▼                                                                       │
│   ┌─────────────┐                                                                │
│   │  05 · Verify │  ONNX Runtime inference — standard + streaming models         │
│   └──────┬──────┘                                                                │
│          │                                                                       │
│          ▼                                                                       │
│   ┌─────────────┐                                                                │
│   │  06 · Sherpa │  Metadata injection · tokens.txt · sherpa_model.tar.gz       │
│   └──────┬──────┘                                                                │
│          │                                                                       │
│          ▼                                                                       │
│   [Android TTS — fully on-device, no internet, forever]                         │
│                                                                                  │
└──────────────────────────────────────────────────────────────────────────────────┘

Notebooks

# Notebook What it does Key libraries
0 Diarize & Clip Speaker diarization on raw recordings. Segments audio per speaker. Resume-safe progress tracking. pyannote.audio 3.1, ffmpeg
1 Voice Clone → Dataset Zero-shot synthesis of 100+ sentences in the cloned voice. Auto-transcription. LJSpeech output format. OmniVoice, faster-whisper large-v3
2 Dataset → Piper CKPT Fine-tunes hi_IN-rohan-medium Piper base model. TensorBoard logging. Checkpoint export to Drive. Piper VITS, PyTorch Lightning, piper-phonemize-fix
3 Check the CKPT Interactive inference on the raw PyTorch checkpoint. Validates voice quality before ONNX export. Piper VITS, espeak-ng, ipywidgets
4 Export to ONNX Exports .ckpt.onnx. Rewrites broken upstream export scripts with all critical fixes applied. torch.onnx (opset 15), onnxscript
5 Check ONNX Validates the exported model. Full encoder→decoder pipeline wired for streaming models. onnxruntime, piper-phonemize-fix
6 Piper → Sherpa-ONNX Injects Sherpa-ONNX metadata into .onnx. Generates tokens.txt. Packages for Android deployment. sherpa-onnx, onnx metadata API

Bugs diagnosed & fixed

All of these were broken in the official Piper notebooks and diagnosed independently:

# Bug Root cause Fix
1 torch.onnx.export crash PyTorch 2.x defaults dynamo=True, which breaks dynamic_axes and None sid inputs Explicitly pass dynamo=False to force the legacy TorchScript exporter
2 CPU/CUDA device mismatch Model loads to CUDA by default; dummy inputs are created on CPU — tracing fails model_g = model_g.cpu() before export
3 Opset version wrong for Sherpa Official notebook used opset_version=15 everywhere; Sherpa-ONNX requires opset 11 opset_version=11 on the Sherpa export path
4 PyTorch 2.6 checkpoint loading Default changed to weights_only=True, breaking Lightning's complex checkpoint objects weights_only=False, strict=False in VitsModel.load_from_checkpoint
5 Colab dependency conflict piper-phonemize has broken Colab deps Replaced with piper-phonemize-fix throughout
6 Streaming inference not wired Inference script detected encoder/decoder pairs but threw "not yet supported" Full encoder → decoder inference pipeline implemented
7 Config file not detected ONNX inference only matched config.json, missing Piper's *.onnx.json naming Added *.onnx.json glob to detect_onnx_models

Google Drive setup

My Drive/
└── Voicecloning/
    ├── raw_calls/                 ← put your MP3 / WAV recordings here
    ├── clipped_audio/             ← Stage 0 output: per-speaker clips
    └── training/
        ├── colab/piper/           ← checkpoints and config.json
        └── piper-voice-packages/  ← exported .tar.gz voice packages

Prerequisites

HuggingFace token — required for pyannote/speaker-diarization-3.1 (Stage 0). Accept the model licence on HuggingFace, then paste your token into the notebook's Colab form field. No .env file needed.

Colab GPU runtime — Stages 1 and 2 need a T4 or better. The free tier works; training runs 6–12 hours. Stages 3–6 can run on CPU.

Google Drive (~2 GB free) — checkpoints, datasets, and exports save to Drive so they survive session restarts.


Android Deployment & Testing

The final sherpa_model.tar.gz can be used with any Sherpa-ONNX compatible Android application.

Recommended App for Testing:

Verified Devices:

  • Redmi 9A
  • Redmi K20 Pro
  • Poco F1

Tech stack

Layer Tools
Diarization pyannote.audio 3.1, ffmpeg
Voice synthesis OmniVoice (k2-fsa), faster-whisper large-v3
TTS architecture VITS via Piper (rmcpantoja fork)
Training PyTorch Lightning, piper-phonemize-fix, espeak-ng
Export torch.onnx TorchScript path, opset 11/15, onnxscript
On-device inference Sherpa-ONNX, onnxruntime
Platform Google Colab (free T4 GPU)

Credits


Amit Basuri · github.com/PositiveMatician · New Delhi

Built because voices shouldn't disappear. All upstream bug fixes (PyTorch 2.6 compat, ONNX export, Sherpa-ONNX conversion) diagnosed and implemented independently.

About

Echoes: A 7-stage Google Colab pipeline to clone voices from raw call recordings into fully on-device Android TTS models. No cloud, no subscriptions, forever.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors