FluidInference · Alex-Wengg · Jun 28, 2026 · Jun 28, 2026
diff --git a/Documentation/ASR/NemotronMultilingual.md b/Documentation/ASR/NemotronMultilingual.md
@@ -10,7 +10,7 @@ FluidAudio supports NVIDIA's `nemotron-asr-streaming-multilingual-0.6b` for real
 | Architecture | FastConformer Cache-Aware RNNT **with Prompt** |
 | Parameters | 0.6B |
 | Languages | ~40 (en, es, de, fr, it, pt, ar, ja, ko, zh-CN, ru, hi, vi, …) |
-| Default Latency Modes | 320 ms · 560 ms · 1120 ms (each is a separate CoreML build) |
+| Default Latency Modes | 560 ms · 1120 ms · 2240 ms (each is a separate CoreML build) |
 | Mel Features | 128 bins, 16 kHz |
 | Vocab Size | 13,087 + 1 blank |
 | Hardware | Apple Silicon only (int8 encoder is ANE-targeted) |
@@ -95,33 +95,54 @@ Scoring follows the [HF Open ASR Leaderboard](https://github.com/huggingface/ope
 - **Non-English Latin** (fr, de, es, it, pt, …) → `BasicTextNormalizer(remove_diacritics=False)` plus an inverse text normalization (ITN) pass: digit runs in the reference are spelled out via `NumberFormatter.spellOut` for the language's locale before WER computation. Required because the model emits "mille neuf cent soixante-seize" while FLEURS keeps "1976" in the reference. Thousands separators handled across all five Unicode space variants FLEURS actually uses (U+0020/00A0/2007/2009/202F). Our `TextNormalizer.basicNormalize(_, spellOutLocale:)`.
 - **CJK** (ja, ko, zh, th) → character-level edit rate after whitespace stripping (segmentation-free). Reported in the "WER" column by community convention.
 
-### Chunk size sweep (FLEURS test split, full data)
+### Chunk size sweep (FLEURS full test split)
+
+Re-measured 2026-06-28 with the **native-Swift mel front-end** (`NemotronMelExtractor`;
+no CoreML preprocessor — see issue #739) over the full `google/fleurs` test splits. All
+builds use `att_context_size=[56,0]`; they differ only in `chunk_mel_frames` → processing
+chunk size. The shipped tiers are now **560 / 1120 / 2240 ms** (the earlier 320 ms tier was
+dropped, 2240 ms added). The per-language vocab-pruned ship and the full multilingual ship
+score identically (en_us @ 2240 ms = 8.72 % on both), so the table uses the full ship.
+
+| Language | 560 ms | 1120 ms | 2240 ms | NVIDIA ([56,0]) | n   |
+|----------|-------:|--------:|--------:|----------------:|----:|
+| en_us    |  9.05  |   8.73  |   8.72  |         11.35   | 647 |
+| fr_fr    |  9.80  |   9.44  |   9.36  |         13.44   | 676 |
+| de_de    | 10.61  |  10.01  |   9.96  |           —     | 862 |
+| es_419   |  4.85  |   4.75  |   4.73  |          8.69   | 908 |
+| ja_jp    | 14.27  |  13.79  |  13.78  |           —     | 650 |
+| it_it    |  5.40  |   5.43  |   5.39  |          7.33   | 865 |
+| pt_br    |  6.38  |   6.16  |   6.19  |          8.99   | 919 |
+| **AVG**  |**8.62**|**8.33** |**8.30** |                 |     |
+| agg RTFx | 40.5x  | 66.0x   |  73.1x  |                 |     |
+
+WER% for spaced scripts, CER% for ja_jp (segmentation-free, whitespace-stripped). Same
+normalizer pipeline as the row above (HF Open-ASR-Leaderboard convention). Aggregate RTFx
+is total audio ÷ total processing across all 7 languages, end-to-end single-stream on Apple
+Silicon (machine/load-dependent — treat the relative ordering, not the absolute, as meaningful).
+
+**Accuracy improves monotonically with chunk size** and meets-or-beats NVIDIA's published
+`[56,0]` numbers on all five published languages (at 2240 ms: en −2.6, fr −4.1, es −4.0,
+it −1.9, pt −2.8 pp). These numbers are ~2–4 pp better than the prior version of this table;
+the gain comes from model / decode-path / normalizer updates since it was written — **not**
+the Swift mel port, which is numerically parity to the removed CoreML preprocessor
+(max |Δ| ≈ 9e-3 vs NeMo PyTorch, confirmed at conversion time). Cross-comparison to NVIDIA is
+sensitive to normalization and should be read as indicative.
+
+Reproduce (one run per tier):
 
-All three builds use `att_context_size=[56,0]` (NVIDIA's lowest-latency mode); they differ only in `chunk_mel_frames` (32 / 56 / 112 → 320 / 560 / 1120 ms processing chunks). NVIDIA's published FLEURS numbers are also at `[56,0]`, so the comparison is architecturally apples-to-apples.
-
-| Language | 320 ms | 560 ms | 1120 ms | NVIDIA ([56,0]) | Δ (1120 vs NVIDIA) | n   |
-|----------|-------:|-------:|--------:|----------------:|-------------------:|----:|
-| en_us    |  17.5  |  12.1  |   12.0  |         11.35   |             +0.65  | 647 |
-| fr_fr    |  16.4  |  13.9  |   13.8  |         13.44   |             +0.36  | 676 |
-| de_de    |  17.8  |  14.9  |   13.6  |           —     |               —    | 862 |
-| es_419   |   8.6  |   7.4  |    7.4  |          8.69   |             −1.29  | 908 |
-| ja_jp    |  21.9  |  18.4  |   17.4  |           —     |               —    | 650 |
-| it_it    |   9.8  |   7.9  |    7.4  |          7.33   |             +0.07  | 865 |
-| pt_br    |  13.4  |  10.0  |    8.4  |          8.99   |             −0.59  | 919 |
-| **AVG**  |**15.0**|**12.1**|**11.4** |                 |                    |     |
-| RTFx     |   8.6  |  16.8  |   22.0  |                 |                    |     |
-
-WER% for spaced scripts, CER% for ja_jp (segmentation-free). Full `google/fleurs` test splits (en=647, fr=676, de=862, es=908, ja=650, it=865, pt=919). The "Δ (1120 vs NVIDIA)" column compares our highest-accuracy build against NVIDIA's published number for the same `[56,0]` attention mode.
-
-**All 5 published languages are within ~0.7 pp of NVIDIA at 1120 ms.** es-419 and pt-br actually beat the reference (−1.29 and −0.59 pp respectively); en, fr, it are +0.65 / +0.36 / +0.07. At 560 ms (the recommended low-latency build) all 5 are within ~1 pp; es-419 still beats NVIDIA by −1.29 pp.
-
-**320 ms shows boundary effects on English and accent-heavy languages.** en_us jumps from 12.0 → 17.5 (+5.5 pp) and pt_br from 8.4 → 13.4 (+5.0 pp) when dropping from 1120 ms to 320 ms. 560 ms recovers most of the loss (<1.6 pp from 1120 ms on every language). If you need low latency, ship 560 ms; only use 320 ms if you absolutely need sub-half-second response and can tolerate the English regression.
+```bash
+swift run -c release fluidaudiocli nemotron-multilingual-benchmark \
+    --model-dir <multilingual ship dir> \
+    --languages en_us,fr_fr,de_de,es_419,ja_jp,it_it,pt_br \
+    --samples all --chunk-ms <560|1120|2240> --output results.json
+```
 
 ### Caveats
 
 - **`MLComputeUnits` matters a lot.** Default `.all` routes the int8 encoder to GPU and runs ~10× slower than ANE. The manager pins `.cpuAndNeuralEngine` automatically; do not override unless you have a reason.
 - **int8 vs fp16 is a wash.** Average WER is identical at all three chunk sizes; per-language drift is within ±1 pp. Ship int8 for the 50% size win and ANE residency.
-- **Two independent latency axes.** NVIDIA's published modes (`att_context_size = [56,0] / [56,3] / [56,6] / [56,13]` → ~80 / 320 / 560 / 1120 ms architectural lookahead) control right-context inside the encoder. Our `320 / 560 / 1120 ms` build labels refer to `chunk_mel_frames` (processing chunk size), not lookahead. All FluidAudio builds currently ship `[56,0]` (no lookahead).
+- **Two independent latency axes.** NVIDIA's published modes (`att_context_size = [56,0] / [56,3] / [56,6] / [56,13]` → ~80 / 320 / 560 / 1120 ms architectural lookahead) control right-context inside the encoder. Our `560 / 1120 / 2240 ms` build labels refer to `chunk_mel_frames` (processing chunk size), not lookahead. All FluidAudio builds currently ship `[56,0]` (no lookahead).
 - **CJK languages** use character-level edit rate as the "WER" field by convention; whitespace tokenization is meaningless for ja/ko/zh/th.
 - **Punctuation density drops at small chunk sizes** ([#687](https://github.com/FluidInference/FluidAudio/issues/687)). On long continuous speech the 560 ms build starts punctuating normally, then commas/periods become increasingly sparse as the session continues; 1120 ms and 2240 ms retain noticeably more punctuation on the same audio, and a session reset restores it. The words themselves are unaffected (WER-neutral) — only punctuation marks thin out. Cause is model-side: shorter chunks give the encoder less right context at sentence boundaries than the published builds' `att_context_size` assumes, and greedy RNN-T decoding compounds the miss over the session. If punctuation matters for your use case, ship 1120 ms or larger, or segment long streams (e.g. reset on VAD silence).
 

diff --git a/Documentation/Benchmarks.md b/Documentation/Benchmarks.md
@@ -117,6 +117,38 @@ swift run -c release fluidaudiocli unified-benchmark --mode streaming --max-file
 swift run -c release fluidaudiocli unified-benchmark --mode batch --precision fp16
 ```
 
+## Nemotron Speech Streaming 0.6B (English)
+
+Cache-aware FastConformer-RNNT streaming, English. Mel features are computed **natively in
+Swift** (`NemotronMelExtractor` → `AudioMelSpectrogram`, NeMo `normalize: NA` raw log-mel) —
+there is no CoreML preprocessor stage. It was removed in the issue #739 fix: the preprocessor's
+flexible `RangeDim` audio input was the source of the `ios17.slice_by_index: zero shape error`
+("Skipped adding default_function to entry point: main") ANE warning behind the iPadOS
+cold-start empty-transcript failure. Encoder int8 on ANE (`.cpuAndNeuralEngine`).
+
+Model: [FluidInference/nemotron-speech-streaming-en-0.6b-coreml](https://huggingface.co/FluidInference/nemotron-speech-streaming-en-0.6b-coreml)
+
+### LibriSpeech test-clean (2620 files, 53,120 words, ~5.4h audio)
+
+| Chunk tier | Aggregate WER | RTFx | Errors / words |
+|------------|---------------|------|----------------|
+| 560 ms (lowest latency) | 2.71%     | 40.7x | 1442 / 53120 |
+| 1120 ms (trained chunk) | **2.58%** | 24.3x | 1369 / 53120 |
+| 2240 ms (default)       | 2.64%     | 87.4x | 1403 / 53120 |
+
+- **WER** is aggregate (total errors ÷ total words across all 2620 files).
+- **RTFx** is end-to-end single-stream (Swift mel + int8 ANE encode + greedy RNN-T), release
+  build, Apple Silicon; absolute RTFx is machine/load-dependent, relative ordering is stable.
+- Accuracy is essentially flat across tiers (2.58–2.71%). 1120 ms has the best WER but lowest
+  throughput; 2240 ms (default) is the throughput sweet spot, within ~0.06 pp of the best WER.
+- Parity: `NemotronMelExtractor` matches NeMo PyTorch raw log-mel to max |Δ| ≈ 9e-3 — the WER
+  here confirms end-to-end correctness (a wrong mel front-end would collapse WER).
+- Multilingual FLEURS results: see [NemotronMultilingual.md](ASR/NemotronMultilingual.md).
+
+```bash
+swift run -c release fluidaudiocli nemotron-benchmark --subset test-clean --chunk <560|1120|2240>
+```
+
 ## Transcription with Keyword Boosting
 
 CTC-based custom vocabulary boosting system, which enables accurate recognition of domain-specific terms (company names, technical jargon, proper nouns) without retraining the ASR model.

diff --git a/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift b/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift
@@ -227,6 +227,7 @@ public enum ASRError: Error, LocalizedError {
     case unsupportedPlatform(String)
     case streamingConversionFailed(Error)
     case fileAccessFailed(URL, Error)
+    case encoderInstantiationFailed(String)
 
     public var errorDescription: String? {
         switch self {
@@ -246,6 +247,8 @@ public enum ASRError: Error, LocalizedError {
             return "Streaming audio conversion failed: \(error.localizedDescription)"
         case .fileAccessFailed(let url, let error):
             return "Failed to access audio file at \(url.path): \(error.localizedDescription)"
+        case .encoderInstantiationFailed(let message):
+            return "Encoder ANE program failed to instantiate: \(message)"
         }
     }
 }
diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/NemotronMelExtractor.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/NemotronMelExtractor.swift
@@ -0,0 +1,68 @@
+@preconcurrency import CoreML
+import Foundation
+
+/// Native-Swift log-mel features for Nemotron streaming, a drop-in replacement
+/// for the CoreML `preprocessor` model.
+///
+/// Reproduces NeMo's `AudioToMelSpectrogramPreprocessor` as configured in
+/// `nvidia/nemotron-speech-streaming-en-0.6b`: `n_fft=512`,
+/// `window_size=0.025` (400), `window_stride=0.01` (160), `features=128`,
+/// `window=hann` (symmetric), `preemph=0.97`, `log` (`2^-24` additive guard),
+/// and crucially **`normalize: NA`** — i.e. *no* per-feature standardization.
+/// That is exactly `AudioMelSpectrogram`'s default front-end with the
+/// normalization step omitted, unlike `UnifiedMelExtractor` whose model uses
+/// `normalize: per_feature`.
+///
+/// Removing the CoreML preprocessor avoids its flexible-shape `RangeDim` audio
+/// input, whose ANE `default_function` was built against a 1-sample lower bound
+/// and raised `ios17.slice_by_index: zero shape error` on iPadOS cold starts
+/// (issue #739).
+struct NemotronMelExtractor {
+    private let mel: AudioMelSpectrogram
+    private let nMels: Int
+    private let hopLength = 160
+
+    init(nMels: Int = 128) {
+        self.nMels = nMels
+        self.mel = AudioMelSpectrogram(
+            sampleRate: 16000,
+            nMels: nMels,
+            nFFT: 512,
+            hopLength: 160,
+            winLength: 400,
+            preemph: 0.97,
+            padTo: 0,
+            windowPeriodic: false
+        )
+    }
+
+    /// Raw (unnormalized) log-mel for one chunk of audio, shaped
+    /// `[1, nMels, T]` with `T = floor((count + n_fft - win) / hop) + 1` (NeMo
+    /// center padding) — the same `mel` tensor the CoreML preprocessor produced,
+    /// frame for frame. `mel_length` was ignored by the pipeline (it sets the
+    /// encoder's `mel_length` to `config.totalMelFrames`), so it is not returned.
+    func melSpectrogram(samples: [Float]) throws -> MLMultiArray {
+        let result = mel.computeFlatTransposed(
+            audio: samples,
+            lastAudioSample: 0,
+            paddingMode: .center,
+            expectedFrameCount: nil
+        )
+        let flat = result.mel
+        let totalFrames = result.numFrames
+
+        let melArray = try MLMultiArray(
+            shape: [1, NSNumber(value: nMels), NSNumber(value: totalFrames)], dataType: .float32)
+        melArray.withUnsafeMutableBufferPointer(ofType: Float.self) { ptr, _ in
+            // Contiguous [1, nMels, T]: element [0, m, t] is at offset m*T + t,
+            // sourced from the time-major flat buffer at t*nMels + m.
+            for t in 0..<totalFrames {
+                let base = t * nMels
+                for m in 0..<nMels {
+                    ptr[m * totalFrames + t] = flat[base + m]
+                }
+            }
+        }
+        return melArray
+    }
+}
diff --git a/...ces/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager+Pipeline.swift b/...ces/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager+Pipeline.swift
@@ -8,7 +8,7 @@ extension StreamingNemotronAsrManager {
 
     /// Process a single audio chunk through the full pipeline
     internal func processChunk(_ samples: [Float]) async throws {
-        guard let preprocessor = preprocessor,
+        guard let melExtractor = melExtractor,
             let encoder = encoder,
             let decoder = decoder,
             let joint = joint,
@@ -24,20 +24,9 @@ extension StreamingNemotronAsrManager {
         // Track decoder state locally to ensure atomicity
         var currentToken = lastToken
 
-        // 1. Preprocessor: audio -> mel spectrogram
-        let audioArray = try createAudioArray(samples)
-        let audioLen = try MLMultiArray(shape: [1], dataType: .int32)
-        audioLen[0] = NSNumber(value: samples.count)
-
-        let preprocInput = try MLDictionaryFeatureProvider(dictionary: [
-            "audio": MLFeatureValue(multiArray: audioArray),
-            "audio_length": MLFeatureValue(multiArray: audioLen),
-        ])
-
-        let preprocOutput = try await preprocessor.prediction(from: preprocInput)
-        guard let chunkMel = preprocOutput.featureValue(for: "mel")?.multiArrayValue else {
-            throw ASRError.processingFailed("Preprocessor failed to produce mel output")
-        }
+        // 1. Native-Swift log-mel front-end (replaces the CoreML preprocessor):
+        //    audio -> raw (unnormalized) log-mel [1, melFeatures, T].
+        let chunkMel = try melExtractor.melSpectrogram(samples: samples)
 
         // 2. Build encoder input: prepend mel_cache (9 frames) + current chunk mel
         let inputMel = try prependMelCache(to: chunkMel)
@@ -198,15 +187,70 @@ extension StreamingNemotronAsrManager {
         processedChunks += 1
     }
 
-    // MARK: - Tensor Utilities
+    // MARK: - Encoder Health Probe
+
+    /// Run one encoder prediction with a non-zero mel probe and report whether
+    /// the encoder produced any non-zero output.
+    ///
+    /// On iPadOS cold starts the int8 encoder's ANE `main` entry point can fail
+    /// to instantiate (logged by CoreML as
+    /// `ANEProgramProcessRequestDirect() Failed with status=0x12`). When that
+    /// happens `prediction` does not throw — it silently returns an all-zero
+    /// `encoded` buffer, so the RNN-T loop only ever sees blanks and the final
+    /// transcript is empty with no error surfaced (issue #739). A single
+    /// non-zero probe distinguishes a working encoder (LayerNorm/bias guarantee
+    /// non-zero output for non-zero input) from a stillborn ANE program, letting
+    /// `loadModels` fail loudly instead of returning empty transcripts.
+    ///
+    /// Uses throwaway local inputs and does not write the encoder's updated
+    /// caches back, so the freshly reset session state is left untouched. The
+    /// probe doubles as a model warm-up.
+    internal func encoderProducesNonZeroOutput() async throws -> Bool {
+        guard let encoder = encoder,
+            let cacheChannel = cacheChannel,
+            let cacheTime = cacheTime,
+            let cacheLen = cacheLen
+        else {
+            throw ASRError.notInitialized
+        }
+
+        // Non-zero mel input ([1, melFeatures, totalMelFrames]) so a healthy
+        // encoder is guaranteed to emit non-zero output. A small ramp avoids a
+        // degenerate constant that could in theory cancel out.
+        let mel = try MLMultiArray(
+            shape: [1, NSNumber(value: config.melFeatures), NSNumber(value: config.totalMelFrames)],
+            dataType: .float32
+        )
+        let melPtr = mel.dataPointer.bindMemory(to: Float.self, capacity: mel.count)
+        for i in 0..<mel.count {
+            melPtr[i] = Float(i % 17) * 0.01 + 0.1
+        }
 
-    internal func createAudioArray(_ samples: [Float]) throws -> MLMultiArray {
-        let array = try MLMultiArray(shape: [1, NSNumber(value: samples.count)], dataType: .float32)
-        let ptr = array.dataPointer.bindMemory(to: Float.self, capacity: samples.count)
-        ptr.update(from: samples, count: samples.count)
-        return array
+        let melLen = try MLMultiArray(shape: [1], dataType: .int32)
+        melLen[0] = NSNumber(value: config.totalMelFrames)
+
+        let encoderInput = try MLDictionaryFeatureProvider(dictionary: [
+            "mel": MLFeatureValue(multiArray: mel),
+            "mel_length": MLFeatureValue(multiArray: melLen),
+            "cache_channel": MLFeatureValue(multiArray: cacheChannel),
+            "cache_time": MLFeatureValue(multiArray: cacheTime),
+            "cache_len": MLFeatureValue(multiArray: cacheLen),
+        ])
+
+        let encoderOutput = try await encoder.prediction(from: encoderInput)
+        guard let encoded = encoderOutput.featureValue(for: "encoded")?.multiArrayValue else {
+            throw ASRError.processingFailed("Encoder probe produced no `encoded` output")
+        }
+
+        let outPtr = encoded.dataPointer.bindMemory(to: Float.self, capacity: encoded.count)
+        for i in 0..<encoded.count where outPtr[i] != 0 {
+            return true
+        }
+        return false
     }
 
+    // MARK: - Tensor Utilities
+
     internal func prependMelCache(to chunkMel: MLMultiArray) throws -> MLMultiArray {
         // Prepend cached mel frames (9) to current chunk mel (112) → [1, 128, 121]
         // Input: chunkMel [1, 128, ~112]