diff --git a/Documentation/ASR/NemotronMultilingual.md b/Documentation/ASR/NemotronMultilingual.md index c925c34c..832a9379 100644 --- a/Documentation/ASR/NemotronMultilingual.md +++ b/Documentation/ASR/NemotronMultilingual.md @@ -10,7 +10,7 @@ FluidAudio supports NVIDIA's `nemotron-asr-streaming-multilingual-0.6b` for real | Architecture | FastConformer Cache-Aware RNNT **with Prompt** | | Parameters | 0.6B | | Languages | ~40 (en, es, de, fr, it, pt, ar, ja, ko, zh-CN, ru, hi, vi, …) | -| Default Latency Modes | 320 ms · 560 ms · 1120 ms (each is a separate CoreML build) | +| Default Latency Modes | 560 ms · 1120 ms · 2240 ms (each is a separate CoreML build) | | Mel Features | 128 bins, 16 kHz | | Vocab Size | 13,087 + 1 blank | | Hardware | Apple Silicon only (int8 encoder is ANE-targeted) | @@ -95,33 +95,54 @@ Scoring follows the [HF Open ASR Leaderboard](https://github.com/huggingface/ope - **Non-English Latin** (fr, de, es, it, pt, …) → `BasicTextNormalizer(remove_diacritics=False)` plus an inverse text normalization (ITN) pass: digit runs in the reference are spelled out via `NumberFormatter.spellOut` for the language's locale before WER computation. Required because the model emits "mille neuf cent soixante-seize" while FLEURS keeps "1976" in the reference. Thousands separators handled across all five Unicode space variants FLEURS actually uses (U+0020/00A0/2007/2009/202F). Our `TextNormalizer.basicNormalize(_, spellOutLocale:)`. - **CJK** (ja, ko, zh, th) → character-level edit rate after whitespace stripping (segmentation-free). Reported in the "WER" column by community convention. -### Chunk size sweep (FLEURS test split, full data) +### Chunk size sweep (FLEURS full test split) + +Re-measured 2026-06-28 with the **native-Swift mel front-end** (`NemotronMelExtractor`; +no CoreML preprocessor — see issue #739) over the full `google/fleurs` test splits. All +builds use `att_context_size=[56,0]`; they differ only in `chunk_mel_frames` → processing +chunk size. The shipped tiers are now **560 / 1120 / 2240 ms** (the earlier 320 ms tier was +dropped, 2240 ms added). The per-language vocab-pruned ship and the full multilingual ship +score identically (en_us @ 2240 ms = 8.72 % on both), so the table uses the full ship. + +| Language | 560 ms | 1120 ms | 2240 ms | NVIDIA ([56,0]) | n | +|----------|-------:|--------:|--------:|----------------:|----:| +| en_us | 9.05 | 8.73 | 8.72 | 11.35 | 647 | +| fr_fr | 9.80 | 9.44 | 9.36 | 13.44 | 676 | +| de_de | 10.61 | 10.01 | 9.96 | — | 862 | +| es_419 | 4.85 | 4.75 | 4.73 | 8.69 | 908 | +| ja_jp | 14.27 | 13.79 | 13.78 | — | 650 | +| it_it | 5.40 | 5.43 | 5.39 | 7.33 | 865 | +| pt_br | 6.38 | 6.16 | 6.19 | 8.99 | 919 | +| **AVG** |**8.62**|**8.33** |**8.30** | | | +| agg RTFx | 40.5x | 66.0x | 73.1x | | | + +WER% for spaced scripts, CER% for ja_jp (segmentation-free, whitespace-stripped). Same +normalizer pipeline as the row above (HF Open-ASR-Leaderboard convention). Aggregate RTFx +is total audio ÷ total processing across all 7 languages, end-to-end single-stream on Apple +Silicon (machine/load-dependent — treat the relative ordering, not the absolute, as meaningful). + +**Accuracy improves monotonically with chunk size** and meets-or-beats NVIDIA's published +`[56,0]` numbers on all five published languages (at 2240 ms: en −2.6, fr −4.1, es −4.0, +it −1.9, pt −2.8 pp). These numbers are ~2–4 pp better than the prior version of this table; +the gain comes from model / decode-path / normalizer updates since it was written — **not** +the Swift mel port, which is numerically parity to the removed CoreML preprocessor +(max |Δ| ≈ 9e-3 vs NeMo PyTorch, confirmed at conversion time). Cross-comparison to NVIDIA is +sensitive to normalization and should be read as indicative. + +Reproduce (one run per tier): -All three builds use `att_context_size=[56,0]` (NVIDIA's lowest-latency mode); they differ only in `chunk_mel_frames` (32 / 56 / 112 → 320 / 560 / 1120 ms processing chunks). NVIDIA's published FLEURS numbers are also at `[56,0]`, so the comparison is architecturally apples-to-apples. - -| Language | 320 ms | 560 ms | 1120 ms | NVIDIA ([56,0]) | Δ (1120 vs NVIDIA) | n | -|----------|-------:|-------:|--------:|----------------:|-------------------:|----:| -| en_us | 17.5 | 12.1 | 12.0 | 11.35 | +0.65 | 647 | -| fr_fr | 16.4 | 13.9 | 13.8 | 13.44 | +0.36 | 676 | -| de_de | 17.8 | 14.9 | 13.6 | — | — | 862 | -| es_419 | 8.6 | 7.4 | 7.4 | 8.69 | −1.29 | 908 | -| ja_jp | 21.9 | 18.4 | 17.4 | — | — | 650 | -| it_it | 9.8 | 7.9 | 7.4 | 7.33 | +0.07 | 865 | -| pt_br | 13.4 | 10.0 | 8.4 | 8.99 | −0.59 | 919 | -| **AVG** |**15.0**|**12.1**|**11.4** | | | | -| RTFx | 8.6 | 16.8 | 22.0 | | | | - -WER% for spaced scripts, CER% for ja_jp (segmentation-free). Full `google/fleurs` test splits (en=647, fr=676, de=862, es=908, ja=650, it=865, pt=919). The "Δ (1120 vs NVIDIA)" column compares our highest-accuracy build against NVIDIA's published number for the same `[56,0]` attention mode. - -**All 5 published languages are within ~0.7 pp of NVIDIA at 1120 ms.** es-419 and pt-br actually beat the reference (−1.29 and −0.59 pp respectively); en, fr, it are +0.65 / +0.36 / +0.07. At 560 ms (the recommended low-latency build) all 5 are within ~1 pp; es-419 still beats NVIDIA by −1.29 pp. - -**320 ms shows boundary effects on English and accent-heavy languages.** en_us jumps from 12.0 → 17.5 (+5.5 pp) and pt_br from 8.4 → 13.4 (+5.0 pp) when dropping from 1120 ms to 320 ms. 560 ms recovers most of the loss (<1.6 pp from 1120 ms on every language). If you need low latency, ship 560 ms; only use 320 ms if you absolutely need sub-half-second response and can tolerate the English regression. +```bash +swift run -c release fluidaudiocli nemotron-multilingual-benchmark \ + --model-dir \ + --languages en_us,fr_fr,de_de,es_419,ja_jp,it_it,pt_br \ + --samples all --chunk-ms <560|1120|2240> --output results.json +``` ### Caveats - **`MLComputeUnits` matters a lot.** Default `.all` routes the int8 encoder to GPU and runs ~10× slower than ANE. The manager pins `.cpuAndNeuralEngine` automatically; do not override unless you have a reason. - **int8 vs fp16 is a wash.** Average WER is identical at all three chunk sizes; per-language drift is within ±1 pp. Ship int8 for the 50% size win and ANE residency. -- **Two independent latency axes.** NVIDIA's published modes (`att_context_size = [56,0] / [56,3] / [56,6] / [56,13]` → ~80 / 320 / 560 / 1120 ms architectural lookahead) control right-context inside the encoder. Our `320 / 560 / 1120 ms` build labels refer to `chunk_mel_frames` (processing chunk size), not lookahead. All FluidAudio builds currently ship `[56,0]` (no lookahead). +- **Two independent latency axes.** NVIDIA's published modes (`att_context_size = [56,0] / [56,3] / [56,6] / [56,13]` → ~80 / 320 / 560 / 1120 ms architectural lookahead) control right-context inside the encoder. Our `560 / 1120 / 2240 ms` build labels refer to `chunk_mel_frames` (processing chunk size), not lookahead. All FluidAudio builds currently ship `[56,0]` (no lookahead). - **CJK languages** use character-level edit rate as the "WER" field by convention; whitespace tokenization is meaningless for ja/ko/zh/th. - **Punctuation density drops at small chunk sizes** ([#687](https://github.com/FluidInference/FluidAudio/issues/687)). On long continuous speech the 560 ms build starts punctuating normally, then commas/periods become increasingly sparse as the session continues; 1120 ms and 2240 ms retain noticeably more punctuation on the same audio, and a session reset restores it. The words themselves are unaffected (WER-neutral) — only punctuation marks thin out. Cause is model-side: shorter chunks give the encoder less right context at sentence boundaries than the published builds' `att_context_size` assumes, and greedy RNN-T decoding compounds the miss over the session. If punctuation matters for your use case, ship 1120 ms or larger, or segment long streams (e.g. reset on VAD silence). diff --git a/Documentation/Benchmarks.md b/Documentation/Benchmarks.md index 2a3d07ed..8383ff09 100644 --- a/Documentation/Benchmarks.md +++ b/Documentation/Benchmarks.md @@ -117,6 +117,38 @@ swift run -c release fluidaudiocli unified-benchmark --mode streaming --max-file swift run -c release fluidaudiocli unified-benchmark --mode batch --precision fp16 ``` +## Nemotron Speech Streaming 0.6B (English) + +Cache-aware FastConformer-RNNT streaming, English. Mel features are computed **natively in +Swift** (`NemotronMelExtractor` → `AudioMelSpectrogram`, NeMo `normalize: NA` raw log-mel) — +there is no CoreML preprocessor stage. It was removed in the issue #739 fix: the preprocessor's +flexible `RangeDim` audio input was the source of the `ios17.slice_by_index: zero shape error` +("Skipped adding default_function to entry point: main") ANE warning behind the iPadOS +cold-start empty-transcript failure. Encoder int8 on ANE (`.cpuAndNeuralEngine`). + +Model: [FluidInference/nemotron-speech-streaming-en-0.6b-coreml](https://huggingface.co/FluidInference/nemotron-speech-streaming-en-0.6b-coreml) + +### LibriSpeech test-clean (2620 files, 53,120 words, ~5.4h audio) + +| Chunk tier | Aggregate WER | RTFx | Errors / words | +|------------|---------------|------|----------------| +| 560 ms (lowest latency) | 2.71% | 40.7x | 1442 / 53120 | +| 1120 ms (trained chunk) | **2.58%** | 24.3x | 1369 / 53120 | +| 2240 ms (default) | 2.64% | 87.4x | 1403 / 53120 | + +- **WER** is aggregate (total errors ÷ total words across all 2620 files). +- **RTFx** is end-to-end single-stream (Swift mel + int8 ANE encode + greedy RNN-T), release + build, Apple Silicon; absolute RTFx is machine/load-dependent, relative ordering is stable. +- Accuracy is essentially flat across tiers (2.58–2.71%). 1120 ms has the best WER but lowest + throughput; 2240 ms (default) is the throughput sweet spot, within ~0.06 pp of the best WER. +- Parity: `NemotronMelExtractor` matches NeMo PyTorch raw log-mel to max |Δ| ≈ 9e-3 — the WER + here confirms end-to-end correctness (a wrong mel front-end would collapse WER). +- Multilingual FLEURS results: see [NemotronMultilingual.md](ASR/NemotronMultilingual.md). + +```bash +swift run -c release fluidaudiocli nemotron-benchmark --subset test-clean --chunk <560|1120|2240> +``` + ## Transcription with Keyword Boosting CTC-based custom vocabulary boosting system, which enables accurate recognition of domain-specific terms (company names, technical jargon, proper nouns) without retraining the ASR model. diff --git a/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift b/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift index b33b2385..d0d99c7f 100644 --- a/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift +++ b/Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift @@ -227,6 +227,7 @@ public enum ASRError: Error, LocalizedError { case unsupportedPlatform(String) case streamingConversionFailed(Error) case fileAccessFailed(URL, Error) + case encoderInstantiationFailed(String) public var errorDescription: String? { switch self { @@ -246,6 +247,8 @@ public enum ASRError: Error, LocalizedError { return "Streaming audio conversion failed: \(error.localizedDescription)" case .fileAccessFailed(let url, let error): return "Failed to access audio file at \(url.path): \(error.localizedDescription)" + case .encoderInstantiationFailed(let message): + return "Encoder ANE program failed to instantiate: \(message)" } } } diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/NemotronMelExtractor.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/NemotronMelExtractor.swift new file mode 100644 index 00000000..0aae3c2f --- /dev/null +++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/NemotronMelExtractor.swift @@ -0,0 +1,68 @@ +@preconcurrency import CoreML +import Foundation + +/// Native-Swift log-mel features for Nemotron streaming, a drop-in replacement +/// for the CoreML `preprocessor` model. +/// +/// Reproduces NeMo's `AudioToMelSpectrogramPreprocessor` as configured in +/// `nvidia/nemotron-speech-streaming-en-0.6b`: `n_fft=512`, +/// `window_size=0.025` (400), `window_stride=0.01` (160), `features=128`, +/// `window=hann` (symmetric), `preemph=0.97`, `log` (`2^-24` additive guard), +/// and crucially **`normalize: NA`** — i.e. *no* per-feature standardization. +/// That is exactly `AudioMelSpectrogram`'s default front-end with the +/// normalization step omitted, unlike `UnifiedMelExtractor` whose model uses +/// `normalize: per_feature`. +/// +/// Removing the CoreML preprocessor avoids its flexible-shape `RangeDim` audio +/// input, whose ANE `default_function` was built against a 1-sample lower bound +/// and raised `ios17.slice_by_index: zero shape error` on iPadOS cold starts +/// (issue #739). +struct NemotronMelExtractor { + private let mel: AudioMelSpectrogram + private let nMels: Int + private let hopLength = 160 + + init(nMels: Int = 128) { + self.nMels = nMels + self.mel = AudioMelSpectrogram( + sampleRate: 16000, + nMels: nMels, + nFFT: 512, + hopLength: 160, + winLength: 400, + preemph: 0.97, + padTo: 0, + windowPeriodic: false + ) + } + + /// Raw (unnormalized) log-mel for one chunk of audio, shaped + /// `[1, nMels, T]` with `T = floor((count + n_fft - win) / hop) + 1` (NeMo + /// center padding) — the same `mel` tensor the CoreML preprocessor produced, + /// frame for frame. `mel_length` was ignored by the pipeline (it sets the + /// encoder's `mel_length` to `config.totalMelFrames`), so it is not returned. + func melSpectrogram(samples: [Float]) throws -> MLMultiArray { + let result = mel.computeFlatTransposed( + audio: samples, + lastAudioSample: 0, + paddingMode: .center, + expectedFrameCount: nil + ) + let flat = result.mel + let totalFrames = result.numFrames + + let melArray = try MLMultiArray( + shape: [1, NSNumber(value: nMels), NSNumber(value: totalFrames)], dataType: .float32) + melArray.withUnsafeMutableBufferPointer(ofType: Float.self) { ptr, _ in + // Contiguous [1, nMels, T]: element [0, m, t] is at offset m*T + t, + // sourced from the time-major flat buffer at t*nMels + m. + for t in 0.. mel spectrogram - let audioArray = try createAudioArray(samples) - let audioLen = try MLMultiArray(shape: [1], dataType: .int32) - audioLen[0] = NSNumber(value: samples.count) - - let preprocInput = try MLDictionaryFeatureProvider(dictionary: [ - "audio": MLFeatureValue(multiArray: audioArray), - "audio_length": MLFeatureValue(multiArray: audioLen), - ]) - - let preprocOutput = try await preprocessor.prediction(from: preprocInput) - guard let chunkMel = preprocOutput.featureValue(for: "mel")?.multiArrayValue else { - throw ASRError.processingFailed("Preprocessor failed to produce mel output") - } + // 1. Native-Swift log-mel front-end (replaces the CoreML preprocessor): + // audio -> raw (unnormalized) log-mel [1, melFeatures, T]. + let chunkMel = try melExtractor.melSpectrogram(samples: samples) // 2. Build encoder input: prepend mel_cache (9 frames) + current chunk mel let inputMel = try prependMelCache(to: chunkMel) @@ -198,15 +187,70 @@ extension StreamingNemotronAsrManager { processedChunks += 1 } - // MARK: - Tensor Utilities + // MARK: - Encoder Health Probe + + /// Run one encoder prediction with a non-zero mel probe and report whether + /// the encoder produced any non-zero output. + /// + /// On iPadOS cold starts the int8 encoder's ANE `main` entry point can fail + /// to instantiate (logged by CoreML as + /// `ANEProgramProcessRequestDirect() Failed with status=0x12`). When that + /// happens `prediction` does not throw — it silently returns an all-zero + /// `encoded` buffer, so the RNN-T loop only ever sees blanks and the final + /// transcript is empty with no error surfaced (issue #739). A single + /// non-zero probe distinguishes a working encoder (LayerNorm/bias guarantee + /// non-zero output for non-zero input) from a stillborn ANE program, letting + /// `loadModels` fail loudly instead of returning empty transcripts. + /// + /// Uses throwaway local inputs and does not write the encoder's updated + /// caches back, so the freshly reset session state is left untouched. The + /// probe doubles as a model warm-up. + internal func encoderProducesNonZeroOutput() async throws -> Bool { + guard let encoder = encoder, + let cacheChannel = cacheChannel, + let cacheTime = cacheTime, + let cacheLen = cacheLen + else { + throw ASRError.notInitialized + } + + // Non-zero mel input ([1, melFeatures, totalMelFrames]) so a healthy + // encoder is guaranteed to emit non-zero output. A small ramp avoids a + // degenerate constant that could in theory cancel out. + let mel = try MLMultiArray( + shape: [1, NSNumber(value: config.melFeatures), NSNumber(value: config.totalMelFrames)], + dataType: .float32 + ) + let melPtr = mel.dataPointer.bindMemory(to: Float.self, capacity: mel.count) + for i in 0.. MLMultiArray { - let array = try MLMultiArray(shape: [1, NSNumber(value: samples.count)], dataType: .float32) - let ptr = array.dataPointer.bindMemory(to: Float.self, capacity: samples.count) - ptr.update(from: samples, count: samples.count) - return array + let melLen = try MLMultiArray(shape: [1], dataType: .int32) + melLen[0] = NSNumber(value: config.totalMelFrames) + + let encoderInput = try MLDictionaryFeatureProvider(dictionary: [ + "mel": MLFeatureValue(multiArray: mel), + "mel_length": MLFeatureValue(multiArray: melLen), + "cache_channel": MLFeatureValue(multiArray: cacheChannel), + "cache_time": MLFeatureValue(multiArray: cacheTime), + "cache_len": MLFeatureValue(multiArray: cacheLen), + ]) + + let encoderOutput = try await encoder.prediction(from: encoderInput) + guard let encoded = encoderOutput.featureValue(for: "encoded")?.multiArrayValue else { + throw ASRError.processingFailed("Encoder probe produced no `encoded` output") + } + + let outPtr = encoded.dataPointer.bindMemory(to: Float.self, capacity: encoded.count) + for i in 0.. MLMultiArray { // Prepend cached mel frames (9) to current chunk mel (112) → [1, 128, 121] // Input: chunkMel [1, 128, ~112] diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager.swift index 9aee93bb..6d02f972 100644 --- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager.swift +++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronAsrManager.swift @@ -11,7 +11,10 @@ public actor StreamingNemotronAsrManager { private let logger = AppLogger(category: "NemotronStreaming") // Models - internal var preprocessor: MLModel? + /// Native-Swift log-mel front-end, replacing the CoreML `preprocessor` + /// model. See `NemotronMelExtractor` (and issue #739) for why the CoreML + /// preprocessor was removed. + internal var melExtractor: NemotronMelExtractor? internal var encoder: MLModel? internal var decoder: MLModel? internal var joint: MLModel? @@ -89,7 +92,8 @@ public actor StreamingNemotronAsrManager { self.partialCallback = callback } - /// Load models from a directory containing preprocessor, encoder, decoder, joint, and tokenizer + /// Load models from a directory containing encoder, decoder, joint, and tokenizer + /// (the mel front-end is computed natively in Swift; no CoreML preprocessor) /// - Parameter directory: Directory containing the model files public func loadModels(from directory: URL) async throws { guard SystemInfo.isAppleSilicon else { @@ -106,9 +110,8 @@ public actor StreamingNemotronAsrManager { logger.info("Loaded config: \(config.chunkMs)ms chunks, \(config.chunkMelFrames) mel frames") } - // Load preprocessor - let preprocessorPath = directory.appendingPathComponent(ModelNames.NemotronStreaming.preprocessorFile) - self.preprocessor = try await MLModel.load(contentsOf: preprocessorPath, configuration: mlConfiguration) + // Native-Swift log-mel front-end (replaces the CoreML preprocessor). + self.melExtractor = NemotronMelExtractor(nMels: config.melFeatures) // Load encoder (int8 quantized) let encoderPath = directory.appendingPathComponent("encoder").appendingPathComponent(NemotronEncoder.fileName) @@ -137,6 +140,23 @@ public actor StreamingNemotronAsrManager { // Initialize states try resetStates() + // Fail loudly if the encoder's ANE program failed to instantiate (issue + // #739): on iPadOS cold starts the int8 encoder can silently return an + // all-zero buffer, yielding an empty transcript with no error thrown. + // Probe once with non-zero input; throw a clear error instead of letting + // every transcript come back empty. The probe also warms the encoder. + let encoderHealthy = try await encoderProducesNonZeroOutput() + guard encoderHealthy else { + throw ASRError.encoderInstantiationFailed( + "Nemotron int8 encoder returned all-zero output on a load-time probe — the ANE " + + "program did not instantiate (see CoreML ANEProgramProcessRequestDirect " + + "status=0x12). This is the iPadOS cold-start failure in issue #739; " + + "re-download or update the encoder model." + ) + } + // Probe used throwaway inputs; restore clean session state. + try resetStates() + logger.info("Nemotron models loaded successfully (\(config.chunkMs)ms chunks).") } @@ -198,7 +218,7 @@ public actor StreamingNemotronAsrManager { public func cleanup() async { await reset() - preprocessor = nil + melExtractor = nil encoder = nil decoder = nil joint = nil @@ -254,7 +274,7 @@ public actor StreamingNemotronAsrManager { /// Process audio and return partial transcript public func process(audioBuffer: AVAudioPCMBuffer) async throws -> String { // Check if models are loaded - guard preprocessor != nil, encoder != nil, decoder != nil, joint != nil else { + guard melExtractor != nil, encoder != nil, decoder != nil, joint != nil else { throw ASRError.notInitialized } @@ -277,7 +297,7 @@ public actor StreamingNemotronAsrManager { public func finish() async throws -> String { // Check if models are loaded guard let tokenizer = tokenizer, - preprocessor != nil, + melExtractor != nil, encoder != nil, decoder != nil, joint != nil @@ -344,7 +364,7 @@ extension StreamingNemotronAsrManager: StreamingAsrManager { } public func processBufferedAudio() async throws { - guard preprocessor != nil, encoder != nil, decoder != nil, joint != nil else { + guard melExtractor != nil, encoder != nil, decoder != nil, joint != nil else { throw ASRError.notInitialized } diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Buffers.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Buffers.swift index 4c1ce436..ad964b50 100644 --- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Buffers.swift +++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Buffers.swift @@ -9,48 +9,16 @@ extension StreamingNemotronMultilingualAsrManager { // MARK: - Tensor Utilities (duplicated from the English pipeline so the // two managers stay independent; the math is small and self-contained). - internal func createAudioArray(_ samples: [Float]) throws -> MLMultiArray { - let array = try MLMultiArray(shape: [1, NSNumber(value: samples.count)], dataType: .float32) - let ptr = array.dataPointer.bindMemory(to: Float.self, capacity: samples.count) - ptr.update(from: samples, count: samples.count) - return array - } - - /// Nonisolated helper for async pipelining — runs the preprocessor on a - /// chunk of samples without touching actor state. Sendable inputs only. - /// Reuses caller-provided `audioInputBuf` / `audioLenBuf` when supplied - /// and shape-compatible; otherwise falls back to fresh allocation. + /// Nonisolated helper for async pipelining — computes the native-Swift + /// log-mel for a chunk without touching actor state. The caller passes a + /// dedicated `melExtractor` instance (distinct from the on-actor one) so + /// the extractor's non-thread-safe FFT scratch buffers are never shared + /// across the concurrent prefetch boundary. nonisolated internal static func runPreprocessorPure( samples: [Float], - preprocessor: MLModel, - audioInputBuf: MLMultiArray? = nil, - audioLenBuf: MLMultiArray? = nil - ) async throws -> MLMultiArray? { - let array: MLMultiArray - let audioLen: MLMultiArray - if let buf = audioInputBuf, - buf.shape[1].intValue == samples.count, - let lenBuf = audioLenBuf - { - let ptr = buf.dataPointer.bindMemory(to: Float.self, capacity: samples.count) - ptr.update(from: samples, count: samples.count) - lenBuf[0] = NSNumber(value: samples.count) - array = buf - audioLen = lenBuf - } else { - array = try MLMultiArray(shape: [1, NSNumber(value: samples.count)], dataType: .float32) - let ptr = array.dataPointer.bindMemory(to: Float.self, capacity: samples.count) - ptr.update(from: samples, count: samples.count) - audioLen = try MLMultiArray(shape: [1], dataType: .int32) - audioLen[0] = NSNumber(value: samples.count) - } - - let input = try MLDictionaryFeatureProvider(dictionary: [ - "audio": MLFeatureValue(multiArray: array), - "audio_length": MLFeatureValue(multiArray: audioLen), - ]) - let output = try await preprocessor.prediction(from: input) - return output.featureValue(for: "mel")?.multiArrayValue + melExtractor: NemotronMelExtractor + ) throws -> MLMultiArray? { + return try melExtractor.melSpectrogram(samples: samples) } /// Triple-stage pipeline helper: runs preprocessor[t+1] + encoder[t+1] in @@ -104,10 +72,8 @@ extension StreamingNemotronMultilingualAsrManager { totalMelFrames: Int, melFeatures: Int, preEncodeCache: Int, - preprocessor: MLModel, - encoder: MLModel, - audioInputBuf: MLMultiArray? = nil, - audioLenBuf: MLMultiArray? = nil + melExtractor: NemotronMelExtractor, + encoder: MLModel ) async throws -> ( encoded: MLMultiArray, encoderProj: MLMultiArray?, @@ -125,11 +91,9 @@ extension StreamingNemotronMultilingualAsrManager { return nil } guard - let chunkMel = try await runPreprocessorPure( + let chunkMel = try runPreprocessorPure( samples: samples, - preprocessor: preprocessor, - audioInputBuf: audioInputBuf, - audioLenBuf: audioLenBuf + melExtractor: melExtractor ) else { return nil diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Pipeline.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Pipeline.swift index 1e15bf75..132a9887 100644 --- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Pipeline.swift +++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Pipeline.swift @@ -58,7 +58,7 @@ extension StreamingNemotronMultilingualAsrManager { internal func processChunk(_ samples: [Float], nextChunkSamples: [Float]? = nil) async throws { // decoder/joint are optional (lean B1 ships omit them) — bound // locally at the sites that need them (smart-spec, unfused fallback). - guard let preprocessor = preprocessor, + guard let melExtractor = melExtractor, let encoder = encoder, let cacheChannel = cacheChannel, let cacheTime = cacheTime, @@ -130,35 +130,8 @@ extension StreamingNemotronMultilingualAsrManager { chunkMel = prefetched self.prefetchedMel = nil } else { - // Reuse pre-allocated audio buffer when sized correctly - // (chunkSamples == config.chunkSamples for normal chunks). - // Final chunk may be shorter (padded) so falls back to fresh - // alloc to match the actual sample count. - let audioArray: MLMultiArray - let audioLen: MLMultiArray - if let buf = audioInputBuf, - buf.shape[1].intValue == samples.count, - let lenBuf = audioLenBuf - { - let ptr = buf.dataPointer.bindMemory(to: Float.self, capacity: samples.count) - ptr.update(from: samples, count: samples.count) - lenBuf[0] = NSNumber(value: samples.count) - audioArray = buf - audioLen = lenBuf - } else { - audioArray = try createAudioArray(samples) - audioLen = try MLMultiArray(shape: [1], dataType: .int32) - audioLen[0] = NSNumber(value: samples.count) - } - let preprocInput = try MLDictionaryFeatureProvider(dictionary: [ - "audio": MLFeatureValue(multiArray: audioArray), - "audio_length": MLFeatureValue(multiArray: audioLen), - ]) - let preprocOutput = try await preprocessor.prediction(from: preprocInput) - guard let mel = preprocOutput.featureValue(for: "mel")?.multiArrayValue else { - throw ASRError.processingFailed("Preprocessor failed to produce mel output") - } - chunkMel = mel + // Native-Swift log-mel (replaces the CoreML preprocessor). + chunkMel = try melExtractor.melSpectrogram(samples: samples) } self.prepNanos &+= DispatchTime.now().uptimeNanoseconds &- prepStart @@ -243,8 +216,9 @@ extension StreamingNemotronMultilingualAsrManager { nonisolated(unsafe) let snapshotCacheTime = self.cacheTime nonisolated(unsafe) let snapshotCacheLen = self.cacheLen nonisolated(unsafe) let snapshotMelCache = self.melCache - nonisolated(unsafe) let snapshotAudioBuf = self.audioInputBuf - nonisolated(unsafe) let snapshotAudioLenBuf = self.audioLenBuf + // Dedicated prefetch extractor instance — never shared with the + // on-actor `melExtractor`, so its FFT scratch buffers can't race. + nonisolated(unsafe) let snapshotMelExtractor = self.prefetchMelExtractor let snapshotPromptId = currentPromptIdValue() let snapshotTotalMelFrames = config.totalMelFrames let snapshotMelFeatures = config.melFeatures @@ -267,7 +241,8 @@ extension StreamingNemotronMultilingualAsrManager { !tripleStageDisabled, let ch = snapshotCacheChannel, let ti = snapshotCacheTime, - let ln = snapshotCacheLen + let ln = snapshotCacheLen, + let mex = snapshotMelExtractor else { return nil } return try await Self.runPrepAndEncoderPure( samples: next, @@ -279,10 +254,8 @@ extension StreamingNemotronMultilingualAsrManager { totalMelFrames: snapshotTotalMelFrames, melFeatures: snapshotMelFeatures, preEncodeCache: snapshotPreEncodeCache, - preprocessor: preprocessor, - encoder: encoder, - audioInputBuf: snapshotAudioBuf, - audioLenBuf: snapshotAudioLenBuf + melExtractor: mex, + encoder: encoder ) }() diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Shared.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Shared.swift index ed946138..121d3cf7 100644 --- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Shared.swift +++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager+Shared.swift @@ -13,7 +13,8 @@ import Foundation /// melCache, prediction output backings) stays inside the manager /// actor. public struct SharedNemotronMultilingualModels: Sendable { - public let preprocessor: MLModel + // The mel front-end is computed natively in Swift per manager + // (`NemotronMelExtractor`); no CoreML preprocessor is shared. public let encoder: MLModel /// Bare prediction LSTM. Optional: a lean ship may omit it when B1 /// (`decoderJoint`) covers the standard path and no smart-spec (K=4) @@ -40,7 +41,6 @@ public struct SharedNemotronMultilingualModels: Sendable { public let mlConfiguration: MLModelConfiguration fileprivate init( - preprocessor: MLModel, encoder: MLModel, decoder: MLModel?, joint: MLModel?, @@ -53,7 +53,6 @@ public struct SharedNemotronMultilingualModels: Sendable { tokenizer: NemotronMultilingualTokenizer, mlConfiguration: MLModelConfiguration ) { - self.preprocessor = preprocessor self.encoder = encoder self.decoder = decoder self.joint = joint @@ -106,13 +105,6 @@ extension StreamingNemotronMultilingualAsrManager { "Loaded multilingual config: \(config.chunkMs)ms chunks, vocab=\(config.vocabSize), \(config.numPrompts) prompts" ) - let preprocessor = try await Self.loadShared( - directory: directory, - compiledName: ModelNames.NemotronMultilingualStreaming.preprocessorFile, - packageName: ModelNames.NemotronMultilingualStreaming.preprocessorPackage, - configuration: mlConfiguration - ) - let encoder = try await Self.loadShared( directory: directory, compiledName: ModelNames.NemotronMultilingualStreaming.encoderFile, @@ -225,7 +217,6 @@ extension StreamingNemotronMultilingualAsrManager { logger.info("Shared models preload complete — ready for N consumers") return SharedNemotronMultilingualModels( - preprocessor: preprocessor, encoder: encoder, decoder: decoder, joint: joint, @@ -254,8 +245,12 @@ extension StreamingNemotronMultilingualAsrManager { self.lastToken = Int32(config.blankIdx) self.currentPromptId = Int32(config.defaultPromptId) + // Each manager builds its own (non-thread-safe) mel extractors; only + // the heavyweight MLModel handles are shared across streams. + self.melExtractor = NemotronMelExtractor(nMels: shared.config.melFeatures) + self.prefetchMelExtractor = NemotronMelExtractor(nMels: shared.config.melFeatures) + // Adopt shared MLModel references - self.preprocessor = shared.preprocessor self.encoder = shared.encoder self.decoder = shared.decoder self.joint = shared.joint diff --git a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager.swift b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager.swift index c2871c76..54e3608e 100644 --- a/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager.swift +++ b/Sources/FluidAudio/ASR/Parakeet/Streaming/Nemotron/StreamingNemotronMultilingualAsrManager.swift @@ -25,7 +25,12 @@ public actor StreamingNemotronMultilingualAsrManager { internal let logger = AppLogger(category: "NemotronMultilingualStreaming") // Models - internal var preprocessor: MLModel? + /// Native-Swift log-mel front-end (replaces the CoreML preprocessor; see + /// `NemotronMelExtractor` and issue #739). Two instances so the on-actor + /// inline path and the concurrent triple-stage prefetch task never share + /// the extractor's (non-thread-safe) FFT scratch buffers. + internal var melExtractor: NemotronMelExtractor? + internal var prefetchMelExtractor: NemotronMelExtractor? internal var encoder: MLModel? internal var decoder: MLModel? internal var joint: MLModel? @@ -173,8 +178,6 @@ public actor StreamingNemotronMultilingualAsrManager { /// (~25-800 allocs/h depending on chunk size). Pre-allocated once, /// refilled in place. Triple-stage helpers are sequential (await before /// next dispatch) so a single shared buffer is safe. - internal var audioInputBuf: MLMultiArray? - internal var audioLenBuf: MLMultiArray? // Triple-stage pipelining: encoder[t+1] dispatched concurrent with // decode[t]. These are the prefetched encoder outputs (and the caches @@ -331,17 +334,11 @@ public actor StreamingNemotronMultilingualAsrManager { "Loaded multilingual config: \(config.chunkMs)ms chunks, vocab=\(config.vocabSize), \(config.numPrompts) prompts, default=\(config.defaultPromptId)" ) - // Load model bundles (prefer .mlmodelc, fall back to .mlpackage with on-demand compile) - let preprocessorURL = try await locateModelBundle( - in: directory, - compiled: ModelNames.NemotronMultilingualStreaming.preprocessorFile, - uncompiled: ModelNames.NemotronMultilingualStreaming.preprocessorPackage - ) - self.preprocessor = try await MLModel.load( - contentsOf: preprocessorURL, - configuration: Self.computeUnitOverride( - name: "FLUIDAUDIO_PREPROCESSOR_CU", base: mlConfiguration, logger: logger) - ) + // Native-Swift log-mel front-end replaces the CoreML preprocessor + // (Nemotron uses `normalize: NA` — raw log-mel — same as the English + // variant). Two instances: inline path + concurrent prefetch task. + self.melExtractor = NemotronMelExtractor(nMels: config.melFeatures) + self.prefetchMelExtractor = NemotronMelExtractor(nMels: config.melFeatures) let encoderURL = try await locateModelBundle( in: directory, @@ -525,16 +522,6 @@ public actor StreamingNemotronMultilingualAsrManager { self.tokenLenBuf = tokLen } - // Reusable preprocessor input buffers — [1, chunkSamples] float32 - // audio + [1] int32 length. Refilled by triple-stage helper. - if let audBuf = try? MLMultiArray(shape: [1, NSNumber(value: config.chunkSamples)], dataType: .float32) { - self.audioInputBuf = audBuf - } - if let audLen = try? MLMultiArray(shape: [1], dataType: .int32) { - audLen[0] = NSNumber(value: config.chunkSamples) - self.audioLenBuf = audLen - } - // First-chunk warm-up: dispatch one zero-input prediction per model so // the ANE program is compiled + resident before the first real // chunk. Cuts ~10-20ms off every clip's first chunk (which can't @@ -555,8 +542,7 @@ public actor StreamingNemotronMultilingualAsrManager { // start — warm them unconditionally. Bare decoder/joint are optional on // lean B1 ships; requiring them here skipped ALL warmup (incl. encoder) // on those ships. They're bound only in the unfused branch below. - guard let preprocessor = preprocessor, - let encoder = encoder, + guard let encoder = encoder, let cacheChannel = cacheChannel, let cacheTime = cacheTime, let cacheLen = cacheLen, @@ -564,20 +550,7 @@ public actor StreamingNemotronMultilingualAsrManager { let cState = cState else { return } - // Preprocessor: 1s of silence - if let audio = try? MLMultiArray(shape: [1, 16000], dataType: .float32), - let audioLen = try? MLMultiArray(shape: [1], dataType: .int32) - { - audio.reset(to: 0) - audioLen[0] = 16000 - let input = try? MLDictionaryFeatureProvider(dictionary: [ - "audio": MLFeatureValue(multiArray: audio), - "audio_length": MLFeatureValue(multiArray: audioLen), - ]) - if let input = input { - _ = try? await preprocessor.prediction(from: input) - } - } + // (No CoreML preprocessor to warm — mel is computed natively in Swift.) // Encoder: zero mel + zeros caches if let mel = try? MLMultiArray( @@ -871,7 +844,8 @@ public actor StreamingNemotronMultilingualAsrManager { public func cleanup() async { await reset() - preprocessor = nil + melExtractor = nil + prefetchMelExtractor = nil encoder = nil decoder = nil joint = nil @@ -954,7 +928,7 @@ public actor StreamingNemotronMultilingualAsrManager { let hasDecodePath = decoderJoint != nil || decoderJointNoEncProj != nil || decoderJointArgmax != nil || (decoder != nil && joint != nil) - guard preprocessor != nil, encoder != nil, hasDecodePath else { + guard melExtractor != nil, encoder != nil, hasDecodePath else { throw ASRError.notInitialized } @@ -999,7 +973,7 @@ public actor StreamingNemotronMultilingualAsrManager { decoderJoint != nil || decoderJointNoEncProj != nil || decoderJointArgmax != nil || (decoder != nil && joint != nil) guard let tokenizer = tokenizer, - preprocessor != nil, + melExtractor != nil, encoder != nil, hasDecodePath else { diff --git a/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift b/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift index 296c0362..bbda2029 100644 --- a/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift +++ b/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift @@ -123,9 +123,10 @@ struct KMeansClustering { best = result } } - return best ?? clusterWithCentroids( - embeddings: embeddings, numClusters: numClusters, - maxIterations: maxIterations, seed: baseSeed) + return best + ?? clusterWithCentroids( + embeddings: embeddings, numClusters: numClusters, + maxIterations: maxIterations, seed: baseSeed) } private static func normalizeEmbeddings(_ embeddings: [[Double]]) -> [[Double]] { diff --git a/Tests/FluidAudioTests/ASR/Parakeet/Streaming/StreamingNemotronAsrManagerTests.swift b/Tests/FluidAudioTests/ASR/Parakeet/Streaming/StreamingNemotronAsrManagerTests.swift index 2b3e77b4..1e58b488 100644 --- a/Tests/FluidAudioTests/ASR/Parakeet/Streaming/StreamingNemotronAsrManagerTests.swift +++ b/Tests/FluidAudioTests/ASR/Parakeet/Streaming/StreamingNemotronAsrManagerTests.swift @@ -189,6 +189,66 @@ final class StreamingNemotronAsrManagerTests: XCTestCase { }) } + func testEncoderInstantiationFailedErrorDescription() { + // Issue #739: loadModels throws this when the encoder ANE program does + // not instantiate and the cold-start probe sees all-zero output. + let error = ASRError.encoderInstantiationFailed("probe returned zeros") + let description = error.errorDescription ?? "" + XCTAssertTrue(description.contains("ANE program failed to instantiate")) + XCTAssertTrue(description.contains("probe returned zeros")) + } + + // MARK: - Native Swift Mel Extractor (replaces CoreML preprocessor) + + func testNemotronMelExtractorShapeAndFrameCount() throws { + // One 1120ms chunk @ 16kHz. NeMo center padding => floor(N/hop)+1 frames. + let n = 17920 + let extractor = NemotronMelExtractor(nMels: 128) + let samples = (0..