diff --git a/Documentation/ASR/NemotronMultilingual.md b/Documentation/ASR/NemotronMultilingual.md
index c925c34c..832a9379 100644
--- a/Documentation/ASR/NemotronMultilingual.md
+++ b/Documentation/ASR/NemotronMultilingual.md
@@ -10,7 +10,7 @@ FluidAudio supports NVIDIA's `nemotron-asr-streaming-multilingual-0.6b` for real
 | Architecture | FastConformer Cache-Aware RNNT **with Prompt** |
 | Parameters | 0.6B |
 | Languages | ~40 (en, es, de, fr, it, pt, ar, ja, ko, zh-CN, ru, hi, vi, …) |
-| Default Latency Modes | 320 ms · 560 ms · 1120 ms (each is a separate CoreML build) |
+| Default Latency Modes | 560 ms · 1120 ms · 2240 ms (each is a separate CoreML build) |
 | Mel Features | 128 bins, 16 kHz |
 | Vocab Size | 13,087 + 1 blank |
 | Hardware | Apple Silicon only (int8 encoder is ANE-targeted) |
@@ -95,33 +95,54 @@ Scoring follows the [HF Open ASR Leaderboard](https://github.com/huggingface/ope
 - **Non-English Latin** (fr, de, es, it, pt, …) → `BasicTextNormalizer(remove_diacritics=False)` plus an inverse text normalization (ITN) pass: digit runs in the reference are spelled out via `NumberFormatter.spellOut` for the language's locale before WER computation. Required because the model emits "mille neuf cent soixante-seize" while FLEURS keeps "1976" in the reference. Thousands separators handled across all five Unicode space variants FLEURS actually uses (U+0020/00A0/2007/2009/202F). Our `TextNormalizer.basicNormalize(_, spellOutLocale:)`.
 - **CJK** (ja, ko, zh, th) → character-level edit rate after whitespace stripping (segmentation-free). Reported in the "WER" column by community convention.
 
-### Chunk size sweep (FLEURS test split, full data)
+### Chunk size sweep (FLEURS full test split)
+
+Re-measured 2026-06-28 with the **native-Swift mel front-end** (`NemotronMelExtractor`;
+no CoreML preprocessor — see issue #739) over the full `google/fleurs` test splits. All
+builds use `att_context_size=[56,0]`; they differ only in `chunk_mel_frames` → processing
+chunk size. The shipped tiers are now **560 / 1120 / 2240 ms** (the earlier 320 ms tier was
+dropped, 2240 ms added). The per-language vocab-pruned ship and the full multilingual ship
+score identically (en_us @ 2240 ms = 8.72 % on both), so the table uses the full ship.
+
+| Language | 560 ms | 1120 ms | 2240 ms | NVIDIA ([56,0]) | n   |
+|----------|-------:|--------:|--------:|----------------:|----:|
+| en_us    |  9.05  |   8.73  |   8.72  |         11.35   | 647 |
+| fr_fr    |  9.80  |   9.44  |   9.36  |         13.44   | 676 |
+| de_de    | 10.61  |  10.01  |   9.96  |           —     | 862 |
+| es_419   |  4.85  |   4.75  |   4.73  |          8.69   | 908 |
+| ja_jp    | 14.27  |  13.79  |  13.78  |           —     | 650 |
+| it_it    |  5.40  |   5.43  |   5.39  |          7.33   | 865 |
+| pt_br    |  6.38  |   6.16  |   6.19  |          8.99   | 919 |
+| **AVG**  |**8.62**|**8.33** |**8.30** |                 |     |
+| agg RTFx | 40.5x  | 66.0x   |  73.1x  |                 |     |
+
+WER% for spaced scripts, CER% for ja_jp (segmentation-free, whitespace-stripped). Same
+normalizer pipeline as the row above (HF Open-ASR-Leaderboard convention). Aggregate RTFx
+is total audio ÷ total processing across all 7 languages, end-to-end single-stream on Apple
+Silicon (machine/load-dependent — treat the relative ordering, not the absolute, as meaningful).
+
+**Accuracy improves monotonically with chunk size** and meets-or-beats NVIDIA's published
+`[56,0]` numbers on all five published languages (at 2240 ms: en −2.6, fr −4.1, es −4.0,
+it −1.9, pt −2.8 pp). These numbers are ~2–4 pp better than the prior version of this table;
+the gain comes from model / decode-path / normalizer updates since it was written — **not**
+the Swift mel port, which is numerically parity to the removed CoreML preprocessor
+(max |Δ| ≈ 9e-3 vs NeMo PyTorch, confirmed at conversion time). Cross-comparison to NVIDIA is
+sensitive to normalization and should be read as indicative.
+
+Reproduce (one run per tier):
 
-All three builds use `att_context_size=[56,0]` (NVIDIA's lowest-latency mode); they differ only in `chunk_mel_frames` (32 / 56 / 112 → 320 / 560 / 1120 ms processing chunks). NVIDIA's published FLEURS numbers are also at `[56,0]`, so the comparison is architecturally apples-to-apples.
-
-| Language | 320 ms | 560 ms | 1120 ms | NVIDIA ([56,0]) | Δ (1120 vs NVIDIA) | n   |
-|----------|-------:|-------:|--------:|----------------:|-------------------:|----:|
-| en_us    |  17.5  |  12.1  |   12.0  |         11.35   |             +0.65  | 647 |
-| fr_fr    |  16.4  |  13.9  |   13.8  |         13.44   |             +0.36  | 676 |
-| de_de    |  17.8  |  14.9  |   13.6  |           —     |               —    | 862 |
-| es_419   |   8.6  |   7.4  |    7.4  |          8.69   |             −1.29  | 908 |
-| ja_jp    |  21.9  |  18.4  |   17.4  |           —     |               —    | 650 |
-| it_it    |   9.8  |   7.9  |    7.4  |          7.33   |             +0.07  | 865 |
-| pt_br    |  13.4  |  10.0  |    8.4  |          8.99   |             −0.59  | 919 |
-| **AVG**  |**15.0**|**12.1**|**11.4** |                 |                    |     |
-| RTFx     |   8.6  |  16.8  |   22.0  |                 |                    |     |
-
-WER% for spaced scripts, CER% for ja_jp (segmentation-free). Full `google/fleurs` test splits (en=647, fr=676, de=862, es=908, ja=650, it=865, pt=919). The "Δ (1120 vs NVIDIA)" column compares our highest-accuracy build against NVIDIA's published number for the same `[56,0]` attention mode.
-
-**All 5 published languages are within ~0.7 pp of NVIDIA at 1120 ms.** es-419 and pt-br actually beat the reference (−1.29 and −0.59 pp respectively); en, fr, it are +0.65 / +0.36 / +0.07. At 560 ms (the recommended low-latency build) all 5 are within ~1 pp; es-419 still beats NVIDIA by −1.29 pp.
-
-**320 ms shows boundary effects on English and accent-heavy languages.** en_us jumps from 12.0 → 17.5 (+5.5 pp) and pt_br from 8.4 → 13.4 (+5.0 pp) when dropping from 1120 ms to 320 ms. 560 ms recovers most of the loss (<1.6 pp from 1120 ms on every language). If you need low latency, ship 560 ms; only use 320 ms if you absolutely need sub-half-second response and can tolerate the English regression.
+```bash
+swift run -c release fluidaudiocli nemotron-multilingual-benchmark \
+    --model-dir <multilingual ship dir> \
+    --languages en_us,fr_fr,de_de,es_419,ja_jp,it_it,pt_br \
+    --samples all --chunk-ms <560|1120|2240> --output results.json
+```
 
 ### Caveats
 
 - **`MLComputeUnits` matters a lot.** Default `.all` routes the int8 encoder to GPU and runs ~10× slower than ANE. The manager pins `.cpuAndNeuralEngine` automatically; do not override unless you have a reason.
 - **int8 vs fp16 is a wash.** Average WER is identical at all three chunk sizes; per-language drift is within ±1 pp. Ship int8 for the 50% size win and ANE residency.
-- **Two independent latency axes.** NVIDIA's published modes (`att_context_size = [56,0] / [56,3] / [56,6] / [56,13]` → ~80 / 320 / 560 / 1120 ms architectural lookahead) control right-context inside the encoder. Our `320 / 560 / 1120 ms` build labels refer to `chunk_mel_frames` (processing chunk size), not lookahead. All FluidAudio builds currently ship `[56,0]` (no lookahead).
+- **Two independent latency axes.** NVIDIA's published modes (`att_context_size = [56,0] / [56,3] / [56,6] / [56,13]` → ~80 / 320 / 560 / 1120 ms architectural lookahead) control right-context inside the encoder. Our `560 / 1120 / 2240 ms` build labels refer to `chunk_mel_frames` (processing chunk size), not lookahead. All FluidAudio builds currently ship `[56,0]` (no lookahead).
 - **CJK languages** use character-level edit rate as the "WER" field by convention; whitespace tokenization is meaningless for ja/ko/zh/th.
 - **Punctuation density drops at small chunk sizes** ([#687](https://github.com/FluidInference/FluidAudio/issues/687)). On long continuous speech the 560 ms build starts punctuating normally, then commas/periods become increasingly sparse as the session continues; 1120 ms and 2240 ms retain noticeably more punctuation on the same audio, and a session reset restores it. The words themselves are unaffected (WER-neutral) — only punctuation marks thin out. Cause is model-side: shorter chunks give the encoder less right context at sentence boundaries than the published builds' `att_context_size` assumes, and greedy RNN-T decoding compounds the miss over the session. If punctuation matters for your use case, ship 1120 ms or larger, or segment long streams (e.g. reset on VAD silence).
 
diff --git a/Documentation/Benchmarks.md b/Documentation/Benchmarks.md
index 2a3d07ed..8383ff09 100644
--- a/Documentation/Benchmarks.md
+++ b/Documentation/Benchmarks.md
@@ -117,6 +117,38 @@ swift run -c release fluidaudiocli unified-benchmark --mode streaming --max-file
 swift run -c release fluidaudiocli unified-benchmark --mode batch --precision fp16
 ```
 
+## Nemotron Speech Streaming 0.6B (English)
+
+Cache-aware FastConformer-RNNT streaming, English. Mel features are computed **natively in
+Swift** (`NemotronMelExtractor` → `AudioMelSpectrogram`, NeMo `normalize: NA` raw log-mel) —
+there is no CoreML preprocessor stage. It was removed in the issue #739 fix: the preprocessor's
+flexible `RangeDim` audio input was the source of the `ios17.slice_by_index: zero shape error`
+("Skipped adding default_function to entry point: main") ANE warning behind the iPadOS
+cold-start empty-transcript failure. Encoder int8 on ANE (`.cpuAndNeuralEngine`).
+
+Model: [FluidInference/nemotron-speech-streaming-en-0.6b-coreml](https://huggingface.co/FluidInference/nemotron-speech-streaming-en-0.6b-coreml)
+
+### LibriSpeech test-clean (2620 files, 53,120 words, ~5.4h audio)
+
+| Chunk tier | Aggregate WER | RTFx | Errors / words |
+|------------|---------------|------|----------------|
+| 560 ms (lowest latency) | 2.71%     | 40.7x | 1442 / 53120 |
+| 1120 ms (trained chunk) | **2.58%** | 24.3x | 1369 / 53120 |
+| 2240 ms (default)       | 2.64%     | 87.4x | 1403 / 53120 |
+
+- **WER** is aggregate (total errors ÷ total words across all 2620 files).
+- **RTFx** is end-to-end single-stream (Swift mel + int8 ANE encode + greedy RNN-T), release
+  build, Apple Silicon; absolute RTFx is machine/load-dependent, relative ordering is stable.
+- Accuracy is essentially flat across tiers (2.58–2.71%). 1120 ms has the best WER but lowest
+  throughput; 2240 ms (default) is the throughput sweet spot, within ~0.06 pp of the best WER.
+- Parity: `NemotronMelExtractor` matches NeMo PyTorch raw log-mel to max |Δ| ≈ 9e-3 — the WER
+  here confirms end-to-end correctness (a wrong mel front-end would collapse WER).
+- Multilingual FLEURS results: see [NemotronMultilingual.md](ASR/NemotronMultilingual.md).
+
+```bash
+swift run -c release fluidaudiocli nemotron-benchmark --subset test-clean --chunk <560|1120|2240>
+```
+
 ## Transcription with Keyword Boosting
 
 CTC-based custom vocabulary boosting system, which enables accurate recognition of domain-specific terms (company names, technical jargon, proper nouns) without retraining the ASR model.
diff --git a/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift b/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift
index b33b2385..d0d99c7f 100644
--- a/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift
+++ b/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift
@@ -227,6 +227,7 @@ public enum ASRError: Error, LocalizedError {
     case unsupportedPlatform(String)
     case streamingConversionFailed(Error)
     case fileAccessFailed(URL, Error)
+    case encoderInstantiationFailed(String)
 
     public var errorDescription: String? {
         switch self {
@@ -246,6 +247,8 @@ public enum ASRError: Error, LocalizedError {
             return "Streaming audio conversion failed: \(error.localizedDescription)"
         case .fileAccessFailed(let url, let error):
             return "Failed to access audio file at \(url.path): \(error.localizedDescription)"
+        case .encoderInstantiationFailed(let message):
+            return "Encoder ANE program failed to instantiate: \(message)"
         }
     }
 }
diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/NemotronMelExtractor.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/NemotronMelExtractor.swift
new file mode 100644
index 00000000..0aae3c2f
--- /dev/null
+++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/NemotronMelExtractor.swift
@@ -0,0 +1,68 @@
+@preconcurrency import CoreML
+import Foundation
+
+/// Native-Swift log-mel features for Nemotron streaming, a drop-in replacement
+/// for the CoreML `preprocessor` model.
+///
+/// Reproduces NeMo's `AudioToMelSpectrogramPreprocessor` as configured in
+/// `nvidia/nemotron-speech-streaming-en-0.6b`: `n_fft=512`,
+/// `window_size=0.025` (400), `window_stride=0.01` (160), `features=128`,
+/// `window=hann` (symmetric), `preemph=0.97`, `log` (`2^-24` additive guard),
+/// and crucially **`normalize: NA`** — i.e. *no* per-feature standardization.
+/// That is exactly `AudioMelSpectrogram`'s default front-end with the
+/// normalization step omitted, unlike `UnifiedMelExtractor` whose model uses
+/// `normalize: per_feature`.
+///
+/// Removing the CoreML preprocessor avoids its flexible-shape `RangeDim` audio
+/// input, whose ANE `default_function` was built against a 1-sample lower bound
+/// and raised `ios17.slice_by_index: zero shape error` on iPadOS cold starts
+/// (issue #739).
+struct NemotronMelExtractor {
+    private let mel: AudioMelSpectrogram
+    private let nMels: Int
+    private let hopLength = 160
+
+    init(nMels: Int = 128) {
+        self.nMels = nMels
+        self.mel = AudioMelSpectrogram(
+            sampleRate: 16000,
+            nMels: nMels,
+            nFFT: 512,
+            hopLength: 160,
+            winLength: 400,
+            preemph: 0.97,
+            padTo: 0,
+            windowPeriodic: false
+        )
+    }
+
+    /// Raw (unnormalized) log-mel for one chunk of audio, shaped
+    /// `[1, nMels, T]` with `T = floor((count + n_fft - win) / hop) + 1` (NeMo
+    /// center padding) — the same `mel` tensor the CoreML preprocessor produced,
+    /// frame for frame. `mel_length` was ignored by the pipeline (it sets the
+    /// encoder's `mel_length` to `config.totalMelFrames`), so it is not returned.
+    func melSpectrogram(samples: [Float]) throws -> MLMultiArray {
+        let result = mel.computeFlatTransposed(
+            audio: samples,
+            lastAudioSample: 0,
+            paddingMode: .center,
+            expectedFrameCount: nil
+        )
+        let flat = result.mel
+        let totalFrames = result.numFrames
+
+        let melArray = try MLMultiArray(
+            shape: [1, NSNumber(value: nMels), NSNumber(value: totalFrames)], dataType: .float32)
+        melArray.withUnsafeMutableBufferPointer(ofType: Float.self) { ptr, _ in
+            // Contiguous [1, nMels, T]: element [0, m, t] is at offset m*T + t,
+            // sourced from the time-major flat buffer at t*nMels + m.
+            for t in 0..<totalFrames {
+                let base = t * nMels
+                for m in 0..<nMels {
+                    ptr[m * totalFrames + t] = flat[base + m]
+                }
+            }
+        }
+        return melArray
+    }
+}
diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager+Pipeline.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager+Pipeline.swift
index e7cae415..340f00aa 100644
--- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager+Pipeline.swift
+++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager+Pipeline.swift
@@ -8,7 +8,7 @@ extension StreamingNemotronAsrManager {
 
     /// Process a single audio chunk through the full pipeline
     internal func processChunk(_ samples: [Float]) async throws {
-        guard let preprocessor = preprocessor,
+        guard let melExtractor = melExtractor,
             let encoder = encoder,
             let decoder = decoder,
             let joint = joint,
@@ -24,20 +24,9 @@ extension StreamingNemotronAsrManager {
         // Track decoder state locally to ensure atomicity
         var currentToken = lastToken
 
-        // 1. Preprocessor: audio -> mel spectrogram
-        let audioArray = try createAudioArray(samples)
-        let audioLen = try MLMultiArray(shape: [1], dataType: .int32)
-        audioLen[0] = NSNumber(value: samples.count)
-
-        let preprocInput = try MLDictionaryFeatureProvider(dictionary: [
-            "audio": MLFeatureValue(multiArray: audioArray),
-            "audio_length": MLFeatureValue(multiArray: audioLen),
-        ])
-
-        let preprocOutput = try await preprocessor.prediction(from: preprocInput)
-        guard let chunkMel = preprocOutput.featureValue(for: "mel")?.multiArrayValue else {
-            throw ASRError.processingFailed("Preprocessor failed to produce mel output")
-        }
+        // 1. Native-Swift log-mel front-end (replaces the CoreML preprocessor):
+        //    audio -> raw (unnormalized) log-mel [1, melFeatures, T].
+        let chunkMel = try melExtractor.melSpectrogram(samples: samples)
 
         // 2. Build encoder input: prepend mel_cache (9 frames) + current chunk mel
         let inputMel = try prependMelCache(to: chunkMel)
@@ -198,15 +187,70 @@ extension StreamingNemotronAsrManager {
         processedChunks += 1
     }
 
-    // MARK: - Tensor Utilities
+    // MARK: - Encoder Health Probe
+
+    /// Run one encoder prediction with a non-zero mel probe and report whether
+    /// the encoder produced any non-zero output.
+    ///
+    /// On iPadOS cold starts the int8 encoder's ANE `main` entry point can fail
+    /// to instantiate (logged by CoreML as
+    /// `ANEProgramProcessRequestDirect() Failed with status=0x12`). When that
+    /// happens `prediction` does not throw — it silently returns an all-zero
+    /// `encoded` buffer, so the RNN-T loop only ever sees blanks and the final
+    /// transcript is empty with no error surfaced (issue #739). A single
+    /// non-zero probe distinguishes a working encoder (LayerNorm/bias guarantee
+    /// non-zero output for non-zero input) from a stillborn ANE program, letting
+    /// `loadModels` fail loudly instead of returning empty transcripts.
+    ///
+    /// Uses throwaway local inputs and does not write the encoder's updated
+    /// caches back, so the freshly reset session state is left untouched. The
+    /// probe doubles as a model warm-up.
+    internal func encoderProducesNonZeroOutput() async throws -> Bool {
+        guard let encoder = encoder,
+            let cacheChannel = cacheChannel,
+            let cacheTime = cacheTime,
+            let cacheLen = cacheLen
+        else {
+            throw ASRError.notInitialized
+        }
+
+        // Non-zero mel input ([1, melFeatures, totalMelFrames]) so a healthy
+        // encoder is guaranteed to emit non-zero output. A small ramp avoids a
+        // degenerate constant that could in theory cancel out.
+        let mel = try MLMultiArray(
+            shape: [1, NSNumber(value: config.melFeatures), NSNumber(value: config.totalMelFrames)],
+            dataType: .float32
+        )
+        let melPtr = mel.dataPointer.bindMemory(to: Float.self, capacity: mel.count)
+        for i in 0..<mel.count {
+            melPtr[i] = Float(i % 17) * 0.01 + 0.1
+        }
 
-    internal func createAudioArray(_ samples: [Float]) throws -> MLMultiArray {
-        let array = try MLMultiArray(shape: [1, NSNumber(value: samples.count)], dataType: .float32)
-        let ptr = array.dataPointer.bindMemory(to: Float.self, capacity: samples.count)
-        ptr.update(from: samples, count: samples.count)
-        return array
+        let melLen = try MLMultiArray(shape: [1], dataType: .int32)
+        melLen[0] = NSNumber(value: config.totalMelFrames)
+
+        let encoderInput = try MLDictionaryFeatureProvider(dictionary: [
+            "mel": MLFeatureValue(multiArray: mel),
+            "mel_length": MLFeatureValue(multiArray: melLen),
+            "cache_channel": MLFeatureValue(multiArray: cacheChannel),
+            "cache_time": MLFeatureValue(multiArray: cacheTime),
+            "cache_len": MLFeatureValue(multiArray: cacheLen),
+        ])
+
+        let encoderOutput = try await encoder.prediction(from: encoderInput)
+        guard let encoded = encoderOutput.featureValue(for: "encoded")?.multiArrayValue else {
+            throw ASRError.processingFailed("Encoder probe produced no `encoded` output")
+        }
+
+        let outPtr = encoded.dataPointer.bindMemory(to: Float.self, capacity: encoded.count)
+        for i in 0..<encoded.count where outPtr[i] != 0 {
+            return true
+        }
+        return false
     }
 
+    // MARK: - Tensor Utilities
+
     internal func prependMelCache(to chunkMel: MLMultiArray) throws -> MLMultiArray {
         // Prepend cached mel frames (9) to current chunk mel (112) → [1, 128, 121]
         // Input: chunkMel [1, 128, ~112]
diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager.swift
index 9aee93bb..6d02f972 100644
--- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager.swift
+++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager.swift
@@ -11,7 +11,10 @@ public actor StreamingNemotronAsrManager {
     private let logger = AppLogger(category: "NemotronStreaming")
 
     // Models
-    internal var preprocessor: MLModel?
+    /// Native-Swift log-mel front-end, replacing the CoreML `preprocessor`
+    /// model. See `NemotronMelExtractor` (and issue #739) for why the CoreML
+    /// preprocessor was removed.
+    internal var melExtractor: NemotronMelExtractor?
     internal var encoder: MLModel?
     internal var decoder: MLModel?
     internal var joint: MLModel?
@@ -89,7 +92,8 @@ public actor StreamingNemotronAsrManager {
         self.partialCallback = callback
     }
 
-    /// Load models from a directory containing preprocessor, encoder, decoder, joint, and tokenizer
+    /// Load models from a directory containing encoder, decoder, joint, and tokenizer
+    /// (the mel front-end is computed natively in Swift; no CoreML preprocessor)
     /// - Parameter directory: Directory containing the model files
     public func loadModels(from directory: URL) async throws {
         guard SystemInfo.isAppleSilicon else {
@@ -106,9 +110,8 @@ public actor StreamingNemotronAsrManager {
             logger.info("Loaded config: \(config.chunkMs)ms chunks, \(config.chunkMelFrames) mel frames")
         }
 
-        // Load preprocessor
-        let preprocessorPath = directory.appendingPathComponent(ModelNames.NemotronStreaming.preprocessorFile)
-        self.preprocessor = try await MLModel.load(contentsOf: preprocessorPath, configuration: mlConfiguration)
+        // Native-Swift log-mel front-end (replaces the CoreML preprocessor).
+        self.melExtractor = NemotronMelExtractor(nMels: config.melFeatures)
 
         // Load encoder (int8 quantized)
         let encoderPath = directory.appendingPathComponent("encoder").appendingPathComponent(NemotronEncoder.fileName)
@@ -137,6 +140,23 @@ public actor StreamingNemotronAsrManager {
         // Initialize states
         try resetStates()
 
+        // Fail loudly if the encoder's ANE program failed to instantiate (issue
+        // #739): on iPadOS cold starts the int8 encoder can silently return an
+        // all-zero buffer, yielding an empty transcript with no error thrown.
+        // Probe once with non-zero input; throw a clear error instead of letting
+        // every transcript come back empty. The probe also warms the encoder.
+        let encoderHealthy = try await encoderProducesNonZeroOutput()
+        guard encoderHealthy else {
+            throw ASRError.encoderInstantiationFailed(
+                "Nemotron int8 encoder returned all-zero output on a load-time probe — the ANE "
+                    + "program did not instantiate (see CoreML ANEProgramProcessRequestDirect "
+                    + "status=0x12). This is the iPadOS cold-start failure in issue #739; "
+                    + "re-download or update the encoder model."
+            )
+        }
+        // Probe used throwaway inputs; restore clean session state.
+        try resetStates()
+
         logger.info("Nemotron models loaded successfully (\(config.chunkMs)ms chunks).")
     }
 
@@ -198,7 +218,7 @@ public actor StreamingNemotronAsrManager {
 
     public func cleanup() async {
         await reset()
-        preprocessor = nil
+        melExtractor = nil
         encoder = nil
         decoder = nil
         joint = nil
@@ -254,7 +274,7 @@ public actor StreamingNemotronAsrManager {
     /// Process audio and return partial transcript
     public func process(audioBuffer: AVAudioPCMBuffer) async throws -> String {
         // Check if models are loaded
-        guard preprocessor != nil, encoder != nil, decoder != nil, joint != nil else {
+        guard melExtractor != nil, encoder != nil, decoder != nil, joint != nil else {
             throw ASRError.notInitialized
         }
 
@@ -277,7 +297,7 @@ public actor StreamingNemotronAsrManager {
     public func finish() async throws -> String {
         // Check if models are loaded
         guard let tokenizer = tokenizer,
-            preprocessor != nil,
+            melExtractor != nil,
             encoder != nil,
             decoder != nil,
             joint != nil
@@ -344,7 +364,7 @@ extension StreamingNemotronAsrManager: StreamingAsrManager {
     }
 
     public func processBufferedAudio() async throws {
-        guard preprocessor != nil, encoder != nil, decoder != nil, joint != nil else {
+        guard melExtractor != nil, encoder != nil, decoder != nil, joint != nil else {
             throw ASRError.notInitialized
         }
 
diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Buffers.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Buffers.swift
index 4c1ce436..ad964b50 100644
--- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Buffers.swift
+++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Buffers.swift
@@ -9,48 +9,16 @@ extension StreamingNemotronMultilingualAsrManager {
     // MARK: - Tensor Utilities (duplicated from the English pipeline so the
     // two managers stay independent; the math is small and self-contained).
 
-    internal func createAudioArray(_ samples: [Float]) throws -> MLMultiArray {
-        let array = try MLMultiArray(shape: [1, NSNumber(value: samples.count)], dataType: .float32)
-        let ptr = array.dataPointer.bindMemory(to: Float.self, capacity: samples.count)
-        ptr.update(from: samples, count: samples.count)
-        return array
-    }
-
-    /// Nonisolated helper for async pipelining — runs the preprocessor on a
-    /// chunk of samples without touching actor state. Sendable inputs only.
-    /// Reuses caller-provided `audioInputBuf` / `audioLenBuf` when supplied
-    /// and shape-compatible; otherwise falls back to fresh allocation.
+    /// Nonisolated helper for async pipelining — computes the native-Swift
+    /// log-mel for a chunk without touching actor state. The caller passes a
+    /// dedicated `melExtractor` instance (distinct from the on-actor one) so
+    /// the extractor's non-thread-safe FFT scratch buffers are never shared
+    /// across the concurrent prefetch boundary.
     nonisolated internal static func runPreprocessorPure(
         samples: [Float],
-        preprocessor: MLModel,
-        audioInputBuf: MLMultiArray? = nil,
-        audioLenBuf: MLMultiArray? = nil
-    ) async throws -> MLMultiArray? {
-        let array: MLMultiArray
-        let audioLen: MLMultiArray
-        if let buf = audioInputBuf,
-            buf.shape[1].intValue == samples.count,
-            let lenBuf = audioLenBuf
-        {
-            let ptr = buf.dataPointer.bindMemory(to: Float.self, capacity: samples.count)
-            ptr.update(from: samples, count: samples.count)
-            lenBuf[0] = NSNumber(value: samples.count)
-            array = buf
-            audioLen = lenBuf
-        } else {
-            array = try MLMultiArray(shape: [1, NSNumber(value: samples.count)], dataType: .float32)
-            let ptr = array.dataPointer.bindMemory(to: Float.self, capacity: samples.count)
-            ptr.update(from: samples, count: samples.count)
-            audioLen = try MLMultiArray(shape: [1], dataType: .int32)
-            audioLen[0] = NSNumber(value: samples.count)
-        }
-
-        let input = try MLDictionaryFeatureProvider(dictionary: [
-            "audio": MLFeatureValue(multiArray: array),
-            "audio_length": MLFeatureValue(multiArray: audioLen),
-        ])
-        let output = try await preprocessor.prediction(from: input)
-        return output.featureValue(for: "mel")?.multiArrayValue
+        melExtractor: NemotronMelExtractor
+    ) throws -> MLMultiArray? {
+        return try melExtractor.melSpectrogram(samples: samples)
     }
 
     /// Triple-stage pipeline helper: runs preprocessor[t+1] + encoder[t+1] in
@@ -104,10 +72,8 @@ extension StreamingNemotronMultilingualAsrManager {
         totalMelFrames: Int,
         melFeatures: Int,
         preEncodeCache: Int,
-        preprocessor: MLModel,
-        encoder: MLModel,
-        audioInputBuf: MLMultiArray? = nil,
-        audioLenBuf: MLMultiArray? = nil
+        melExtractor: NemotronMelExtractor,
+        encoder: MLModel
     ) async throws -> (
         encoded: MLMultiArray,
         encoderProj: MLMultiArray?,
@@ -125,11 +91,9 @@ extension StreamingNemotronMultilingualAsrManager {
             return nil
         }
         guard
-            let chunkMel = try await runPreprocessorPure(
+            let chunkMel = try runPreprocessorPure(
                 samples: samples,
-                preprocessor: preprocessor,
-                audioInputBuf: audioInputBuf,
-                audioLenBuf: audioLenBuf
+                melExtractor: melExtractor
             )
         else {
             return nil
diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Pipeline.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Pipeline.swift
index 1e15bf75..132a9887 100644
--- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Pipeline.swift
+++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Pipeline.swift
@@ -58,7 +58,7 @@ extension StreamingNemotronMultilingualAsrManager {
     internal func processChunk(_ samples: [Float], nextChunkSamples: [Float]? = nil) async throws {
         // decoder/joint are optional (lean B1 ships omit them) — bound
         // locally at the sites that need them (smart-spec, unfused fallback).
-        guard let preprocessor = preprocessor,
+        guard let melExtractor = melExtractor,
             let encoder = encoder,
             let cacheChannel = cacheChannel,
             let cacheTime = cacheTime,
@@ -130,35 +130,8 @@ extension StreamingNemotronMultilingualAsrManager {
                 chunkMel = prefetched
                 self.prefetchedMel = nil
             } else {
-                // Reuse pre-allocated audio buffer when sized correctly
-                // (chunkSamples == config.chunkSamples for normal chunks).
-                // Final chunk may be shorter (padded) so falls back to fresh
-                // alloc to match the actual sample count.
-                let audioArray: MLMultiArray
-                let audioLen: MLMultiArray
-                if let buf = audioInputBuf,
-                    buf.shape[1].intValue == samples.count,
-                    let lenBuf = audioLenBuf
-                {
-                    let ptr = buf.dataPointer.bindMemory(to: Float.self, capacity: samples.count)
-                    ptr.update(from: samples, count: samples.count)
-                    lenBuf[0] = NSNumber(value: samples.count)
-                    audioArray = buf
-                    audioLen = lenBuf
-                } else {
-                    audioArray = try createAudioArray(samples)
-                    audioLen = try MLMultiArray(shape: [1], dataType: .int32)
-                    audioLen[0] = NSNumber(value: samples.count)
-                }
-                let preprocInput = try MLDictionaryFeatureProvider(dictionary: [
-                    "audio": MLFeatureValue(multiArray: audioArray),
-                    "audio_length": MLFeatureValue(multiArray: audioLen),
-                ])
-                let preprocOutput = try await preprocessor.prediction(from: preprocInput)
-                guard let mel = preprocOutput.featureValue(for: "mel")?.multiArrayValue else {
-                    throw ASRError.processingFailed("Preprocessor failed to produce mel output")
-                }
-                chunkMel = mel
+                // Native-Swift log-mel (replaces the CoreML preprocessor).
+                chunkMel = try melExtractor.melSpectrogram(samples: samples)
             }
             self.prepNanos &+= DispatchTime.now().uptimeNanoseconds &- prepStart
 
@@ -243,8 +216,9 @@ extension StreamingNemotronMultilingualAsrManager {
         nonisolated(unsafe) let snapshotCacheTime = self.cacheTime
         nonisolated(unsafe) let snapshotCacheLen = self.cacheLen
         nonisolated(unsafe) let snapshotMelCache = self.melCache
-        nonisolated(unsafe) let snapshotAudioBuf = self.audioInputBuf
-        nonisolated(unsafe) let snapshotAudioLenBuf = self.audioLenBuf
+        // Dedicated prefetch extractor instance — never shared with the
+        // on-actor `melExtractor`, so its FFT scratch buffers can't race.
+        nonisolated(unsafe) let snapshotMelExtractor = self.prefetchMelExtractor
         let snapshotPromptId = currentPromptIdValue()
         let snapshotTotalMelFrames = config.totalMelFrames
         let snapshotMelFeatures = config.melFeatures
@@ -267,7 +241,8 @@ extension StreamingNemotronMultilingualAsrManager {
                     !tripleStageDisabled,
                     let ch = snapshotCacheChannel,
                     let ti = snapshotCacheTime,
-                    let ln = snapshotCacheLen
+                    let ln = snapshotCacheLen,
+                    let mex = snapshotMelExtractor
                 else { return nil }
                 return try await Self.runPrepAndEncoderPure(
                     samples: next,
@@ -279,10 +254,8 @@ extension StreamingNemotronMultilingualAsrManager {
                     totalMelFrames: snapshotTotalMelFrames,
                     melFeatures: snapshotMelFeatures,
                     preEncodeCache: snapshotPreEncodeCache,
-                    preprocessor: preprocessor,
-                    encoder: encoder,
-                    audioInputBuf: snapshotAudioBuf,
-                    audioLenBuf: snapshotAudioLenBuf
+                    melExtractor: mex,
+                    encoder: encoder
                 )
             }()
 
diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Shared.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Shared.swift
index ed946138..121d3cf7 100644
--- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Shared.swift
+++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Shared.swift
@@ -13,7 +13,8 @@ import Foundation
 /// melCache, prediction output backings) stays inside the manager
 /// actor.
 public struct SharedNemotronMultilingualModels: Sendable {
-    public let preprocessor: MLModel
+    // The mel front-end is computed natively in Swift per manager
+    // (`NemotronMelExtractor`); no CoreML preprocessor is shared.
     public let encoder: MLModel
     /// Bare prediction LSTM. Optional: a lean ship may omit it when B1
     /// (`decoderJoint`) covers the standard path and no smart-spec (K=4)
@@ -40,7 +41,6 @@ public struct SharedNemotronMultilingualModels: Sendable {
     public let mlConfiguration: MLModelConfiguration
 
     fileprivate init(
-        preprocessor: MLModel,
         encoder: MLModel,
         decoder: MLModel?,
         joint: MLModel?,
@@ -53,7 +53,6 @@ public struct SharedNemotronMultilingualModels: Sendable {
         tokenizer: NemotronMultilingualTokenizer,
         mlConfiguration: MLModelConfiguration
     ) {
-        self.preprocessor = preprocessor
         self.encoder = encoder
         self.decoder = decoder
         self.joint = joint
@@ -106,13 +105,6 @@ extension StreamingNemotronMultilingualAsrManager {
             "Loaded multilingual config: \(config.chunkMs)ms chunks, vocab=\(config.vocabSize), \(config.numPrompts) prompts"
         )
 
-        let preprocessor = try await Self.loadShared(
-            directory: directory,
-            compiledName: ModelNames.NemotronMultilingualStreaming.preprocessorFile,
-            packageName: ModelNames.NemotronMultilingualStreaming.preprocessorPackage,
-            configuration: mlConfiguration
-        )
-
         let encoder = try await Self.loadShared(
             directory: directory,
             compiledName: ModelNames.NemotronMultilingualStreaming.encoderFile,
@@ -225,7 +217,6 @@ extension StreamingNemotronMultilingualAsrManager {
         logger.info("Shared models preload complete — ready for N consumers")
 
         return SharedNemotronMultilingualModels(
-            preprocessor: preprocessor,
             encoder: encoder,
             decoder: decoder,
             joint: joint,
@@ -254,8 +245,12 @@ extension StreamingNemotronMultilingualAsrManager {
         self.lastToken = Int32(config.blankIdx)
         self.currentPromptId = Int32(config.defaultPromptId)
 
+        // Each manager builds its own (non-thread-safe) mel extractors; only
+        // the heavyweight MLModel handles are shared across streams.
+        self.melExtractor = NemotronMelExtractor(nMels: shared.config.melFeatures)
+        self.prefetchMelExtractor = NemotronMelExtractor(nMels: shared.config.melFeatures)
+
         // Adopt shared MLModel references
-        self.preprocessor = shared.preprocessor
         self.encoder = shared.encoder
         self.decoder = shared.decoder
         self.joint = shared.joint
diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager.swift
index c2871c76..54e3608e 100644
--- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager.swift
+++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager.swift
@@ -25,7 +25,12 @@ public actor StreamingNemotronMultilingualAsrManager {
     internal let logger = AppLogger(category: "NemotronMultilingualStreaming")
 
     // Models
-    internal var preprocessor: MLModel?
+    /// Native-Swift log-mel front-end (replaces the CoreML preprocessor; see
+    /// `NemotronMelExtractor` and issue #739). Two instances so the on-actor
+    /// inline path and the concurrent triple-stage prefetch task never share
+    /// the extractor's (non-thread-safe) FFT scratch buffers.
+    internal var melExtractor: NemotronMelExtractor?
+    internal var prefetchMelExtractor: NemotronMelExtractor?
     internal var encoder: MLModel?
     internal var decoder: MLModel?
     internal var joint: MLModel?
@@ -173,8 +178,6 @@ public actor StreamingNemotronMultilingualAsrManager {
     /// (~25-800 allocs/h depending on chunk size). Pre-allocated once,
     /// refilled in place. Triple-stage helpers are sequential (await before
     /// next dispatch) so a single shared buffer is safe.
-    internal var audioInputBuf: MLMultiArray?
-    internal var audioLenBuf: MLMultiArray?
 
     // Triple-stage pipelining: encoder[t+1] dispatched concurrent with
     // decode[t]. These are the prefetched encoder outputs (and the caches
@@ -331,17 +334,11 @@ public actor StreamingNemotronMultilingualAsrManager {
             "Loaded multilingual config: \(config.chunkMs)ms chunks, vocab=\(config.vocabSize), \(config.numPrompts) prompts, default=\(config.defaultPromptId)"
         )
 
-        // Load model bundles (prefer .mlmodelc, fall back to .mlpackage with on-demand compile)
-        let preprocessorURL = try await locateModelBundle(
-            in: directory,
-            compiled: ModelNames.NemotronMultilingualStreaming.preprocessorFile,
-            uncompiled: ModelNames.NemotronMultilingualStreaming.preprocessorPackage
-        )
-        self.preprocessor = try await MLModel.load(
-            contentsOf: preprocessorURL,
-            configuration: Self.computeUnitOverride(
-                name: "FLUIDAUDIO_PREPROCESSOR_CU", base: mlConfiguration, logger: logger)
-        )
+        // Native-Swift log-mel front-end replaces the CoreML preprocessor
+        // (Nemotron uses `normalize: NA` — raw log-mel — same as the English
+        // variant). Two instances: inline path + concurrent prefetch task.
+        self.melExtractor = NemotronMelExtractor(nMels: config.melFeatures)
+        self.prefetchMelExtractor = NemotronMelExtractor(nMels: config.melFeatures)
 
         let encoderURL = try await locateModelBundle(
             in: directory,
@@ -525,16 +522,6 @@ public actor StreamingNemotronMultilingualAsrManager {
             self.tokenLenBuf = tokLen
         }
 
-        // Reusable preprocessor input buffers — [1, chunkSamples] float32
-        // audio + [1] int32 length. Refilled by triple-stage helper.
-        if let audBuf = try? MLMultiArray(shape: [1, NSNumber(value: config.chunkSamples)], dataType: .float32) {
-            self.audioInputBuf = audBuf
-        }
-        if let audLen = try? MLMultiArray(shape: [1], dataType: .int32) {
-            audLen[0] = NSNumber(value: config.chunkSamples)
-            self.audioLenBuf = audLen
-        }
-
         // First-chunk warm-up: dispatch one zero-input prediction per model so
         // the ANE program is compiled + resident before the first real
         // chunk. Cuts ~10-20ms off every clip's first chunk (which can't
@@ -555,8 +542,7 @@ public actor StreamingNemotronMultilingualAsrManager {
         // start — warm them unconditionally. Bare decoder/joint are optional on
         // lean B1 ships; requiring them here skipped ALL warmup (incl. encoder)
         // on those ships. They're bound only in the unfused branch below.
-        guard let preprocessor = preprocessor,
-            let encoder = encoder,
+        guard let encoder = encoder,
             let cacheChannel = cacheChannel,
             let cacheTime = cacheTime,
             let cacheLen = cacheLen,
@@ -564,20 +550,7 @@ public actor StreamingNemotronMultilingualAsrManager {
             let cState = cState
         else { return }
 
-        // Preprocessor: 1s of silence
-        if let audio = try? MLMultiArray(shape: [1, 16000], dataType: .float32),
-            let audioLen = try? MLMultiArray(shape: [1], dataType: .int32)
-        {
-            audio.reset(to: 0)
-            audioLen[0] = 16000
-            let input = try? MLDictionaryFeatureProvider(dictionary: [
-                "audio": MLFeatureValue(multiArray: audio),
-                "audio_length": MLFeatureValue(multiArray: audioLen),
-            ])
-            if let input = input {
-                _ = try? await preprocessor.prediction(from: input)
-            }
-        }
+        // (No CoreML preprocessor to warm — mel is computed natively in Swift.)
 
         // Encoder: zero mel + zeros caches
         if let mel = try? MLMultiArray(
@@ -871,7 +844,8 @@ public actor StreamingNemotronMultilingualAsrManager {
 
     public func cleanup() async {
         await reset()
-        preprocessor = nil
+        melExtractor = nil
+        prefetchMelExtractor = nil
         encoder = nil
         decoder = nil
         joint = nil
@@ -954,7 +928,7 @@ public actor StreamingNemotronMultilingualAsrManager {
         let hasDecodePath =
             decoderJoint != nil || decoderJointNoEncProj != nil || decoderJointArgmax != nil
             || (decoder != nil && joint != nil)
-        guard preprocessor != nil, encoder != nil, hasDecodePath else {
+        guard melExtractor != nil, encoder != nil, hasDecodePath else {
             throw ASRError.notInitialized
         }
 
@@ -999,7 +973,7 @@ public actor StreamingNemotronMultilingualAsrManager {
             decoderJoint != nil || decoderJointNoEncProj != nil || decoderJointArgmax != nil
             || (decoder != nil && joint != nil)
         guard let tokenizer = tokenizer,
-            preprocessor != nil,
+            melExtractor != nil,
             encoder != nil,
             hasDecodePath
         else {
diff --git a/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift b/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift
index 296c0362..bbda2029 100644
--- a/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift
+++ b/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift
@@ -123,9 +123,10 @@ struct KMeansClustering {
                 best = result
             }
         }
-        return best ?? clusterWithCentroids(
-            embeddings: embeddings, numClusters: numClusters,
-            maxIterations: maxIterations, seed: baseSeed)
+        return best
+            ?? clusterWithCentroids(
+                embeddings: embeddings, numClusters: numClusters,
+                maxIterations: maxIterations, seed: baseSeed)
     }
 
     private static func normalizeEmbeddings(_ embeddings: [[Double]]) -> [[Double]] {
diff --git a/Tests/FluidAudioTests/ASR/Parakeet/Streaming/StreamingNemotronAsrManagerTests.swift b/Tests/FluidAudioTests/ASR/Parakeet/Streaming/StreamingNemotronAsrManagerTests.swift
index 2b3e77b4..1e58b488 100644
--- a/Tests/FluidAudioTests/ASR/Parakeet/Streaming/StreamingNemotronAsrManagerTests.swift
+++ b/Tests/FluidAudioTests/ASR/Parakeet/Streaming/StreamingNemotronAsrManagerTests.swift
@@ -189,6 +189,66 @@ final class StreamingNemotronAsrManagerTests: XCTestCase {
             })
     }
 
+    func testEncoderInstantiationFailedErrorDescription() {
+        // Issue #739: loadModels throws this when the encoder ANE program does
+        // not instantiate and the cold-start probe sees all-zero output.
+        let error = ASRError.encoderInstantiationFailed("probe returned zeros")
+        let description = error.errorDescription ?? ""
+        XCTAssertTrue(description.contains("ANE program failed to instantiate"))
+        XCTAssertTrue(description.contains("probe returned zeros"))
+    }
+
+    // MARK: - Native Swift Mel Extractor (replaces CoreML preprocessor)
+
+    func testNemotronMelExtractorShapeAndFrameCount() throws {
+        // One 1120ms chunk @ 16kHz. NeMo center padding => floor(N/hop)+1 frames.
+        let n = 17920
+        let extractor = NemotronMelExtractor(nMels: 128)
+        let samples = (0..<n).map { Float(sin(2.0 * Double.pi * 440.0 * Double($0) / 16000.0)) * 0.5 }
+
+        let mel = try extractor.melSpectrogram(samples: samples)
+        XCTAssertEqual(mel.shape.map { $0.intValue }, [1, 128, n / 160 + 1])  // [1, 128, 113]
+    }
+
+    func testNemotronMelExtractorIsDeterministic() throws {
+        let extractor = NemotronMelExtractor(nMels: 128)
+        let samples = (0..<8000).map { Float(sin(2.0 * Double.pi * 220.0 * Double($0) / 16000.0)) }
+
+        let a = try extractor.melSpectrogram(samples: samples)
+        let b = try extractor.melSpectrogram(samples: samples)
+        let pa = a.dataPointer.bindMemory(to: Float.self, capacity: a.count)
+        let pb = b.dataPointer.bindMemory(to: Float.self, capacity: b.count)
+        for i in 0..<a.count {
+            XCTAssertEqual(pa[i], pb[i])
+        }
+    }
+
+    func testNemotronMelExtractorIsNotPerFeatureNormalized() throws {
+        // Nemotron uses `normalize: NA` (raw log-mel), unlike the Unified model's
+        // per-feature standardization. Raw log-mel bins have large non-zero means
+        // (~-15); per-feature normalization would force each bin's mean to ~0.
+        let frames = 113
+        let nMels = 128
+        let extractor = NemotronMelExtractor(nMels: nMels)
+        let samples = (0..<17920).map { Float(sin(2.0 * Double.pi * 440.0 * Double($0) / 16000.0)) * 0.5 }
+
+        let mel = try extractor.melSpectrogram(samples: samples)
+        let ptr = mel.dataPointer.bindMemory(to: Float.self, capacity: mel.count)
+
+        var maxAbsBinMean: Float = 0
+        for m in 0..<nMels {
+            var sum: Float = 0
+            for t in 0..<frames { sum += ptr[m * frames + t] }
+            maxAbsBinMean = max(maxAbsBinMean, abs(sum / Float(frames)))
+        }
+        // A per-feature-normalized tensor would have every bin mean ~0.
+        XCTAssertGreaterThan(maxAbsBinMean, 1.0)
+        // Output must be finite.
+        for i in 0..<mel.count {
+            XCTAssertTrue(ptr[i].isFinite)
+        }
+    }
+
     // MARK: - P0: Stride Calculation Tests
 
     func testStrideCalculationWithContiguousArray() throws {