Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 43 additions & 22 deletions Documentation/ASR/NemotronMultilingual.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ FluidAudio supports NVIDIA's `nemotron-asr-streaming-multilingual-0.6b` for real
| Architecture | FastConformer Cache-Aware RNNT **with Prompt** |
| Parameters | 0.6B |
| Languages | ~40 (en, es, de, fr, it, pt, ar, ja, ko, zh-CN, ru, hi, vi, …) |
| Default Latency Modes | 320 ms · 560 ms · 1120 ms (each is a separate CoreML build) |
| Default Latency Modes | 560 ms · 1120 ms · 2240 ms (each is a separate CoreML build) |
| Mel Features | 128 bins, 16 kHz |
| Vocab Size | 13,087 + 1 blank |
| Hardware | Apple Silicon only (int8 encoder is ANE-targeted) |
Expand Down Expand Up @@ -95,33 +95,54 @@ Scoring follows the [HF Open ASR Leaderboard](https://github.com/huggingface/ope
- **Non-English Latin** (fr, de, es, it, pt, …) → `BasicTextNormalizer(remove_diacritics=False)` plus an inverse text normalization (ITN) pass: digit runs in the reference are spelled out via `NumberFormatter.spellOut` for the language's locale before WER computation. Required because the model emits "mille neuf cent soixante-seize" while FLEURS keeps "1976" in the reference. Thousands separators handled across all five Unicode space variants FLEURS actually uses (U+0020/00A0/2007/2009/202F). Our `TextNormalizer.basicNormalize(_, spellOutLocale:)`.
- **CJK** (ja, ko, zh, th) → character-level edit rate after whitespace stripping (segmentation-free). Reported in the "WER" column by community convention.

### Chunk size sweep (FLEURS test split, full data)
### Chunk size sweep (FLEURS full test split)

Re-measured 2026-06-28 with the **native-Swift mel front-end** (`NemotronMelExtractor`;
no CoreML preprocessor — see issue #739) over the full `google/fleurs` test splits. All
builds use `att_context_size=[56,0]`; they differ only in `chunk_mel_frames` → processing
chunk size. The shipped tiers are now **560 / 1120 / 2240 ms** (the earlier 320 ms tier was
dropped, 2240 ms added). The per-language vocab-pruned ship and the full multilingual ship
score identically (en_us @ 2240 ms = 8.72 % on both), so the table uses the full ship.

| Language | 560 ms | 1120 ms | 2240 ms | NVIDIA ([56,0]) | n |
|----------|-------:|--------:|--------:|----------------:|----:|
| en_us | 9.05 | 8.73 | 8.72 | 11.35 | 647 |
| fr_fr | 9.80 | 9.44 | 9.36 | 13.44 | 676 |
| de_de | 10.61 | 10.01 | 9.96 | — | 862 |
| es_419 | 4.85 | 4.75 | 4.73 | 8.69 | 908 |
| ja_jp | 14.27 | 13.79 | 13.78 | — | 650 |
| it_it | 5.40 | 5.43 | 5.39 | 7.33 | 865 |
| pt_br | 6.38 | 6.16 | 6.19 | 8.99 | 919 |
| **AVG** |**8.62**|**8.33** |**8.30** | | |
| agg RTFx | 40.5x | 66.0x | 73.1x | | |

WER% for spaced scripts, CER% for ja_jp (segmentation-free, whitespace-stripped). Same
normalizer pipeline as the row above (HF Open-ASR-Leaderboard convention). Aggregate RTFx
is total audio ÷ total processing across all 7 languages, end-to-end single-stream on Apple
Silicon (machine/load-dependent — treat the relative ordering, not the absolute, as meaningful).

**Accuracy improves monotonically with chunk size** and meets-or-beats NVIDIA's published
`[56,0]` numbers on all five published languages (at 2240 ms: en −2.6, fr −4.1, es −4.0,
it −1.9, pt −2.8 pp). These numbers are ~2–4 pp better than the prior version of this table;
the gain comes from model / decode-path / normalizer updates since it was written — **not**
the Swift mel port, which is numerically parity to the removed CoreML preprocessor
(max |Δ| ≈ 9e-3 vs NeMo PyTorch, confirmed at conversion time). Cross-comparison to NVIDIA is
sensitive to normalization and should be read as indicative.

Reproduce (one run per tier):

All three builds use `att_context_size=[56,0]` (NVIDIA's lowest-latency mode); they differ only in `chunk_mel_frames` (32 / 56 / 112 → 320 / 560 / 1120 ms processing chunks). NVIDIA's published FLEURS numbers are also at `[56,0]`, so the comparison is architecturally apples-to-apples.

| Language | 320 ms | 560 ms | 1120 ms | NVIDIA ([56,0]) | Δ (1120 vs NVIDIA) | n |
|----------|-------:|-------:|--------:|----------------:|-------------------:|----:|
| en_us | 17.5 | 12.1 | 12.0 | 11.35 | +0.65 | 647 |
| fr_fr | 16.4 | 13.9 | 13.8 | 13.44 | +0.36 | 676 |
| de_de | 17.8 | 14.9 | 13.6 | — | — | 862 |
| es_419 | 8.6 | 7.4 | 7.4 | 8.69 | −1.29 | 908 |
| ja_jp | 21.9 | 18.4 | 17.4 | — | — | 650 |
| it_it | 9.8 | 7.9 | 7.4 | 7.33 | +0.07 | 865 |
| pt_br | 13.4 | 10.0 | 8.4 | 8.99 | −0.59 | 919 |
| **AVG** |**15.0**|**12.1**|**11.4** | | | |
| RTFx | 8.6 | 16.8 | 22.0 | | | |

WER% for spaced scripts, CER% for ja_jp (segmentation-free). Full `google/fleurs` test splits (en=647, fr=676, de=862, es=908, ja=650, it=865, pt=919). The "Δ (1120 vs NVIDIA)" column compares our highest-accuracy build against NVIDIA's published number for the same `[56,0]` attention mode.

**All 5 published languages are within ~0.7 pp of NVIDIA at 1120 ms.** es-419 and pt-br actually beat the reference (−1.29 and −0.59 pp respectively); en, fr, it are +0.65 / +0.36 / +0.07. At 560 ms (the recommended low-latency build) all 5 are within ~1 pp; es-419 still beats NVIDIA by −1.29 pp.

**320 ms shows boundary effects on English and accent-heavy languages.** en_us jumps from 12.0 → 17.5 (+5.5 pp) and pt_br from 8.4 → 13.4 (+5.0 pp) when dropping from 1120 ms to 320 ms. 560 ms recovers most of the loss (<1.6 pp from 1120 ms on every language). If you need low latency, ship 560 ms; only use 320 ms if you absolutely need sub-half-second response and can tolerate the English regression.
```bash
swift run -c release fluidaudiocli nemotron-multilingual-benchmark \
--model-dir <multilingual ship dir> \
--languages en_us,fr_fr,de_de,es_419,ja_jp,it_it,pt_br \
--samples all --chunk-ms <560|1120|2240> --output results.json
```

### Caveats

- **`MLComputeUnits` matters a lot.** Default `.all` routes the int8 encoder to GPU and runs ~10× slower than ANE. The manager pins `.cpuAndNeuralEngine` automatically; do not override unless you have a reason.
- **int8 vs fp16 is a wash.** Average WER is identical at all three chunk sizes; per-language drift is within ±1 pp. Ship int8 for the 50% size win and ANE residency.
- **Two independent latency axes.** NVIDIA's published modes (`att_context_size = [56,0] / [56,3] / [56,6] / [56,13]` → ~80 / 320 / 560 / 1120 ms architectural lookahead) control right-context inside the encoder. Our `320 / 560 / 1120 ms` build labels refer to `chunk_mel_frames` (processing chunk size), not lookahead. All FluidAudio builds currently ship `[56,0]` (no lookahead).
- **Two independent latency axes.** NVIDIA's published modes (`att_context_size = [56,0] / [56,3] / [56,6] / [56,13]` → ~80 / 320 / 560 / 1120 ms architectural lookahead) control right-context inside the encoder. Our `560 / 1120 / 2240 ms` build labels refer to `chunk_mel_frames` (processing chunk size), not lookahead. All FluidAudio builds currently ship `[56,0]` (no lookahead).
- **CJK languages** use character-level edit rate as the "WER" field by convention; whitespace tokenization is meaningless for ja/ko/zh/th.
- **Punctuation density drops at small chunk sizes** ([#687](https://github.com/FluidInference/FluidAudio/issues/687)). On long continuous speech the 560 ms build starts punctuating normally, then commas/periods become increasingly sparse as the session continues; 1120 ms and 2240 ms retain noticeably more punctuation on the same audio, and a session reset restores it. The words themselves are unaffected (WER-neutral) — only punctuation marks thin out. Cause is model-side: shorter chunks give the encoder less right context at sentence boundaries than the published builds' `att_context_size` assumes, and greedy RNN-T decoding compounds the miss over the session. If punctuation matters for your use case, ship 1120 ms or larger, or segment long streams (e.g. reset on VAD silence).

Expand Down
32 changes: 32 additions & 0 deletions Documentation/Benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,38 @@ swift run -c release fluidaudiocli unified-benchmark --mode streaming --max-file
swift run -c release fluidaudiocli unified-benchmark --mode batch --precision fp16
```

## Nemotron Speech Streaming 0.6B (English)

Cache-aware FastConformer-RNNT streaming, English. Mel features are computed **natively in
Swift** (`NemotronMelExtractor` → `AudioMelSpectrogram`, NeMo `normalize: NA` raw log-mel) —
there is no CoreML preprocessor stage. It was removed in the issue #739 fix: the preprocessor's
flexible `RangeDim` audio input was the source of the `ios17.slice_by_index: zero shape error`
("Skipped adding default_function to entry point: main") ANE warning behind the iPadOS
cold-start empty-transcript failure. Encoder int8 on ANE (`.cpuAndNeuralEngine`).

Model: [FluidInference/nemotron-speech-streaming-en-0.6b-coreml](https://huggingface.co/FluidInference/nemotron-speech-streaming-en-0.6b-coreml)

### LibriSpeech test-clean (2620 files, 53,120 words, ~5.4h audio)

| Chunk tier | Aggregate WER | RTFx | Errors / words |
|------------|---------------|------|----------------|
| 560 ms (lowest latency) | 2.71% | 40.7x | 1442 / 53120 |
| 1120 ms (trained chunk) | **2.58%** | 24.3x | 1369 / 53120 |
| 2240 ms (default) | 2.64% | 87.4x | 1403 / 53120 |

- **WER** is aggregate (total errors ÷ total words across all 2620 files).
- **RTFx** is end-to-end single-stream (Swift mel + int8 ANE encode + greedy RNN-T), release
build, Apple Silicon; absolute RTFx is machine/load-dependent, relative ordering is stable.
- Accuracy is essentially flat across tiers (2.58–2.71%). 1120 ms has the best WER but lowest
throughput; 2240 ms (default) is the throughput sweet spot, within ~0.06 pp of the best WER.
- Parity: `NemotronMelExtractor` matches NeMo PyTorch raw log-mel to max |Δ| ≈ 9e-3 — the WER
here confirms end-to-end correctness (a wrong mel front-end would collapse WER).
- Multilingual FLEURS results: see [NemotronMultilingual.md](ASR/NemotronMultilingual.md).

```bash
swift run -c release fluidaudiocli nemotron-benchmark --subset test-clean --chunk <560|1120|2240>
```

## Transcription with Keyword Boosting

CTC-based custom vocabulary boosting system, which enables accurate recognition of domain-specific terms (company names, technical jargon, proper nouns) without retraining the ASR model.
Expand Down
3 changes: 3 additions & 0 deletions Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,7 @@ public enum ASRError: Error, LocalizedError {
case unsupportedPlatform(String)
case streamingConversionFailed(Error)
case fileAccessFailed(URL, Error)
case encoderInstantiationFailed(String)

public var errorDescription: String? {
switch self {
Expand All @@ -246,6 +247,8 @@ public enum ASRError: Error, LocalizedError {
return "Streaming audio conversion failed: \(error.localizedDescription)"
case .fileAccessFailed(let url, let error):
return "Failed to access audio file at \(url.path): \(error.localizedDescription)"
case .encoderInstantiationFailed(let message):
return "Encoder ANE program failed to instantiate: \(message)"
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
@preconcurrency import CoreML
import Foundation

/// Native-Swift log-mel features for Nemotron streaming, a drop-in replacement
/// for the CoreML `preprocessor` model.
///
/// Reproduces NeMo's `AudioToMelSpectrogramPreprocessor` as configured in
/// `nvidia/nemotron-speech-streaming-en-0.6b`: `n_fft=512`,
/// `window_size=0.025` (400), `window_stride=0.01` (160), `features=128`,
/// `window=hann` (symmetric), `preemph=0.97`, `log` (`2^-24` additive guard),
/// and crucially **`normalize: NA`** — i.e. *no* per-feature standardization.
/// That is exactly `AudioMelSpectrogram`'s default front-end with the
/// normalization step omitted, unlike `UnifiedMelExtractor` whose model uses
/// `normalize: per_feature`.
///
/// Removing the CoreML preprocessor avoids its flexible-shape `RangeDim` audio
/// input, whose ANE `default_function` was built against a 1-sample lower bound
/// and raised `ios17.slice_by_index: zero shape error` on iPadOS cold starts
/// (issue #739).
struct NemotronMelExtractor {
private let mel: AudioMelSpectrogram
private let nMels: Int
private let hopLength = 160

init(nMels: Int = 128) {
self.nMels = nMels
self.mel = AudioMelSpectrogram(
sampleRate: 16000,
nMels: nMels,
nFFT: 512,
hopLength: 160,
winLength: 400,
preemph: 0.97,
padTo: 0,
windowPeriodic: false
)
}

/// Raw (unnormalized) log-mel for one chunk of audio, shaped
/// `[1, nMels, T]` with `T = floor((count + n_fft - win) / hop) + 1` (NeMo
/// center padding) — the same `mel` tensor the CoreML preprocessor produced,
/// frame for frame. `mel_length` was ignored by the pipeline (it sets the
/// encoder's `mel_length` to `config.totalMelFrames`), so it is not returned.
func melSpectrogram(samples: [Float]) throws -> MLMultiArray {
let result = mel.computeFlatTransposed(
audio: samples,
lastAudioSample: 0,
paddingMode: .center,
expectedFrameCount: nil
)
let flat = result.mel
let totalFrames = result.numFrames

let melArray = try MLMultiArray(
shape: [1, NSNumber(value: nMels), NSNumber(value: totalFrames)], dataType: .float32)
melArray.withUnsafeMutableBufferPointer(ofType: Float.self) { ptr, _ in
// Contiguous [1, nMels, T]: element [0, m, t] is at offset m*T + t,
// sourced from the time-major flat buffer at t*nMels + m.
for t in 0..<totalFrames {
let base = t * nMels
for m in 0..<nMels {
ptr[m * totalFrames + t] = flat[base + m]
}
}
}
return melArray
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ extension StreamingNemotronAsrManager {

/// Process a single audio chunk through the full pipeline
internal func processChunk(_ samples: [Float]) async throws {
guard let preprocessor = preprocessor,
guard let melExtractor = melExtractor,
let encoder = encoder,
let decoder = decoder,
let joint = joint,
Expand All @@ -24,20 +24,9 @@ extension StreamingNemotronAsrManager {
// Track decoder state locally to ensure atomicity
var currentToken = lastToken

// 1. Preprocessor: audio -> mel spectrogram
let audioArray = try createAudioArray(samples)
let audioLen = try MLMultiArray(shape: [1], dataType: .int32)
audioLen[0] = NSNumber(value: samples.count)

let preprocInput = try MLDictionaryFeatureProvider(dictionary: [
"audio": MLFeatureValue(multiArray: audioArray),
"audio_length": MLFeatureValue(multiArray: audioLen),
])

let preprocOutput = try await preprocessor.prediction(from: preprocInput)
guard let chunkMel = preprocOutput.featureValue(for: "mel")?.multiArrayValue else {
throw ASRError.processingFailed("Preprocessor failed to produce mel output")
}
// 1. Native-Swift log-mel front-end (replaces the CoreML preprocessor):
// audio -> raw (unnormalized) log-mel [1, melFeatures, T].
let chunkMel = try melExtractor.melSpectrogram(samples: samples)

// 2. Build encoder input: prepend mel_cache (9 frames) + current chunk mel
let inputMel = try prependMelCache(to: chunkMel)
Expand Down Expand Up @@ -198,15 +187,70 @@ extension StreamingNemotronAsrManager {
processedChunks += 1
}

// MARK: - Tensor Utilities
// MARK: - Encoder Health Probe

/// Run one encoder prediction with a non-zero mel probe and report whether
/// the encoder produced any non-zero output.
///
/// On iPadOS cold starts the int8 encoder's ANE `main` entry point can fail
/// to instantiate (logged by CoreML as
/// `ANEProgramProcessRequestDirect() Failed with status=0x12`). When that
/// happens `prediction` does not throw — it silently returns an all-zero
/// `encoded` buffer, so the RNN-T loop only ever sees blanks and the final
/// transcript is empty with no error surfaced (issue #739). A single
/// non-zero probe distinguishes a working encoder (LayerNorm/bias guarantee
/// non-zero output for non-zero input) from a stillborn ANE program, letting
/// `loadModels` fail loudly instead of returning empty transcripts.
///
/// Uses throwaway local inputs and does not write the encoder's updated
/// caches back, so the freshly reset session state is left untouched. The
/// probe doubles as a model warm-up.
internal func encoderProducesNonZeroOutput() async throws -> Bool {
guard let encoder = encoder,
let cacheChannel = cacheChannel,
let cacheTime = cacheTime,
let cacheLen = cacheLen
else {
throw ASRError.notInitialized
}

// Non-zero mel input ([1, melFeatures, totalMelFrames]) so a healthy
// encoder is guaranteed to emit non-zero output. A small ramp avoids a
// degenerate constant that could in theory cancel out.
let mel = try MLMultiArray(
shape: [1, NSNumber(value: config.melFeatures), NSNumber(value: config.totalMelFrames)],
dataType: .float32
)
let melPtr = mel.dataPointer.bindMemory(to: Float.self, capacity: mel.count)
for i in 0..<mel.count {
melPtr[i] = Float(i % 17) * 0.01 + 0.1
}

internal func createAudioArray(_ samples: [Float]) throws -> MLMultiArray {
let array = try MLMultiArray(shape: [1, NSNumber(value: samples.count)], dataType: .float32)
let ptr = array.dataPointer.bindMemory(to: Float.self, capacity: samples.count)
ptr.update(from: samples, count: samples.count)
return array
let melLen = try MLMultiArray(shape: [1], dataType: .int32)
melLen[0] = NSNumber(value: config.totalMelFrames)

let encoderInput = try MLDictionaryFeatureProvider(dictionary: [
"mel": MLFeatureValue(multiArray: mel),
"mel_length": MLFeatureValue(multiArray: melLen),
"cache_channel": MLFeatureValue(multiArray: cacheChannel),
"cache_time": MLFeatureValue(multiArray: cacheTime),
"cache_len": MLFeatureValue(multiArray: cacheLen),
])

let encoderOutput = try await encoder.prediction(from: encoderInput)
guard let encoded = encoderOutput.featureValue(for: "encoded")?.multiArrayValue else {
throw ASRError.processingFailed("Encoder probe produced no `encoded` output")
}

let outPtr = encoded.dataPointer.bindMemory(to: Float.self, capacity: encoded.count)
for i in 0..<encoded.count where outPtr[i] != 0 {
return true
}
return false
}

// MARK: - Tensor Utilities

internal func prependMelCache(to chunkMel: MLMultiArray) throws -> MLMultiArray {
// Prepend cached mel frames (9) to current chunk mel (112) → [1, 128, 121]
// Input: chunkMel [1, 128, ~112]
Expand Down
Loading
Loading