GigaAM: the family of open-source acoustic models for speech processing

Latest News

2025/11 — GigaAM-v3: 30% WER reduction on new data domains; GigaAM-v3-e2e: end-to-end transcription support (70:30 win in Side-by-Side vs Whisper-large-v3)
2025/06 — Our research paper on GigaAM was accepted to InterSpeech 2025!
2024/12 — MIT License, GigaAM-v2 (-15% and -12% WER Reduction for CTC and RNN-T models, respectively), ONNX export support
2024/05 — GigaAM-RNNT (-19% WER Reduction), long-form inference using external Voice Activity Detection
2024/04 — GigaAM Release: GigaAM-CTC (SoTA Speech Recognition model for the Russian language), GigaAM-Emo

Setup

Requirements

Python ≥ 3.10
ffmpeg installed and added to your system's PATH

Install the GigaAM Package

# Clone the repository
git clone https://github.com/salute-developers/GigaAM.git
cd GigaAM

# Install the package requirements
pip install -e .[torch]

# (optionally) Verify the installation:
pip install -e ".[tests]"
pytest -v tests/test_loading.py -m partial  # or `-m full` to test all models

GigaAM overview

GigaAM is a Conformer-based foundational model (220-240M parameters) pre-trained on diverse Russian speech data. It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition. More information about GigaAM-v1 can be found in our post on Habr. We fine-tuned the GigaAM encoder for ASR using CTC and RNNT decoders. GigaAM family includes three lines of models

	Pretrain Method	Pretrain (hours)	ASR (hours)	Available Versions
v1	Wav2vec 2.0	50,000	2,000	`v1_ssl`, `emo`, `v1_ctc`, `v1_rnnt`
v2	HuBERT–CTC	50,000	2,000	`v2_ssl`, `v2_ctc`, `v2_rnnt`
v3	HuBERT–CTC	700,000	4,000	`v3_ssl`, `v3_ctc`, `v3_rnnt`, `v3_e2e_ctc`, `v3_e2e_rnnt`

Where v3_e2e_ctc and v3_e2e_rnnt support punctuation and text normalization.

Model Performance

GigaAM-v3 training incorporates new internal datasets: callcenter, music, speech with atypical characteristics, and voice messages. As a result, the models perform on average 30% better on these new domains while maintaining the same quality as GigaAM-v2 on public benchmarks. In end-to-end ASR comparisons of e2e_ctc and e2e_rnnt against Whisper (judged via independent LLM-as-a-Judge side-by-side) GigaAM models win by an average margin of 70:30. Our emotion recognition model GigaAM-Emo outperforms existing models by 15% Macro F1-Score.

For detailed results, see here.

Usage

Model inference

Note: ASR with .transcribe function is applicable for audio only up to 25 seconds. To enable .transcribe_longform install the additional pyannote.audio dependencies

Longform setup instruction

Generate Hugging Face API token
Accept the conditions to access pyannote/segmentation-3.0 files and content

pip install -e ".[longform]"
# optionally run longform testing
pip install -e ".[tests]"
HF_TOKEN=<your hf token> pytest -v tests/test_longform.py

import gigaam

# Load test audio
audio_path = gigaam.utils.download_short_audio()
long_audio_path = gigaam.utils.download_long_audio()

# Audio embeddings
model_name = "v3_ssl"       # Options: `v1_ssl`, `v2_ssl`, `v3_ssl`
model = gigaam.load_model(model_name)
embedding, _ = model.embed_audio(audio_path)
print(embedding)

# ASR
model_name = "v3_e2e_rnnt"  # Options: any model version with suffix `_ctc` or `_rnnt`
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)
print(transcription)

# ASR with word-level timestamps
result = model.transcribe(audio_path, word_timestamps=True)
for word in result.words:
    print(f"  [{word.start:.2f} - {word.end:.2f}] {word.text}")

# and long-form ASR
import os
os.environ["HF_TOKEN"] = <HF_TOKEN with read access to "pyannote/segmentation-3.0">
result = model.transcribe_longform(long_audio_path)
for segment in result:
   print(f"[{gigaam.format_time(segment.start)} - {gigaam.format_time(segment.end)}]: {segment.text}")

# Emotion recognition
model = gigaam.load_model("emo")
emotion2prob = model.get_probs(audio_path)
print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))

Loading from Hugging Face

Note: Install requirements from the example.

from transformers import AutoModel

model = AutoModel.from_pretrained("ai-sage/GigaAM-v3", revision="e2e_rnnt", trust_remote_code=True)

ONNX Export and Inference

Note: GPU support can be enabled with pip install onnxruntime-gpu==1.23.* if applicable.

Export the model to ONNX using the model.to_onnx method:

onnx_dir = "onnx"
model_version = "v3_ctc"  # Options: any version

model = gigaam.load_model(model_version)
model.to_onnx(dir_path=onnx_dir)

Run ONNX inference:

from gigaam.onnx_utils import load_onnx, infer_onnx

sessions, model_cfg = load_onnx(onnx_dir, model_version)
result = infer_onnx(audio_path, model_cfg, sessions)
print(result)  # string for ctc / rnnt, np.ndarray for ssl / emo

These and more advanced (e.g. custom audio loading, batching) examples can be found in the Colab notebook.

Triton Inference Server and TensorRT

All speech recognition models can also be used in a server environment in ONNX/TRT format through Triton Inference Server. For setup instructions, model conversion, and deployment details, see the Triton Inference Server documentation.

Citation

If you use GigaAM in your research, please cite our paper:

@inproceedings{kutsakov25_interspeech,
  title     = {{GigaAM: Efficient Self-Supervised Learner for Speech Recognition}},
  author    = {Aleksandr Kutsakov and Alexandr Maximenko and Georgii Gospodinov and Pavel Bogomolov and Fyodor Minkin},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {1213--1217},
  doi       = {10.21437/Interspeech.2025-1616},
  issn      = {2958-1796},
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
assets		assets
gigaam		gigaam
tests		tests
triton_scripts		triton_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ru.md		README_ru.md
colab_example.ipynb		colab_example.ipynb
evaluation.md		evaluation.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GigaAM: the family of open-source acoustic models for speech processing

Latest News

Setup

Requirements

Install the GigaAM Package

GigaAM overview

Model Performance

Usage

Model inference

Loading from Hugging Face

ONNX Export and Inference

Triton Inference Server and TensorRT

Citation

Links

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GigaAM: the family of open-source acoustic models for speech processing

Latest News

Setup

Requirements

Install the GigaAM Package

GigaAM overview

Model Performance

Usage

Model inference

Loading from Hugging Face

ONNX Export and Inference

Triton Inference Server and TensorRT

Citation

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages