Summary
On iPadOS 26.5 (tested on Apple M1), StreamingNemotronAsrManager loads "successfully" and accepts audio but decodes no tokens (no partials, empty final) on a cold process start. The identical model files run correctly on macOS. The only way to get output on iOS is to load and run a different FluidAudio CoreML model first (e.g. TDT SlidingWindowAsrManager) — and that prime is consumed per session.
Environment
- Fails (tested): iPad, Apple M1, 8 GB, iPadOS 26.5
- Works: MacBook Pro, Apple M1 Pro, 16 GB, macOS 26.5
- FluidAudio: 0.15.4 · Model:
parakeet-nemotron-streaming-0.6b (int8, B1 fused), tiers 560/1120/2240 ms
- Compute units: default
.cpuAndNeuralEngine (also repro'd with .all)
Scope (M1 vs M2+)
- Only M1 tested. The macOS box is M1 Pro — the same-generation 16-core ANE (~11 TOPS) as the failing M1 iPad — so an identical-gen ANE runs this model fine under macOS, pointing the failure at the iPadOS 26.5 CoreML/ANE runtime, not ANE silicon or RAM. Newer ANE (M2 ~15.8, M3 ~18, M4 / A17 Pro+ ~35–38 TOPS, a newer design) is untested — this is not an "all iOS" claim.
- No iOS validation documented upstream. The HF card's only benchmark is "Tested on Apple M2 with FluidAudio" (LibriSpeech WER/RTFx — the desktop/CLI path); no iOS/iPadOS run at any tier. The card is also stale: it lists 1120/560/160/80 ms but not the shipped 2240 ms tier (and still lists 160/80 ms, which 0.15.4 drops) — iOS is entirely outside the documented tested envelope.
- RAM differs (8 vs 16 GB), but the workaround (load an extra model first) uses more memory yet fixes it — arguing against OOM. A 16 GB M2/M4 iPad repro would retire both the RAM and ANE-generation variables at once.
- NOTE: the Neural Engine is the same on both devices. M1 and M1 Pro ship the identical 16-core ANE, ~11 TOPS, same microarchitecture. M1 Pro's advantages over M1 are all elsewhere: more CPU/GPU cores, much higher memory bandwidth (~200 vs ~68
GB/s), and higher max RAM. The ANE block is unchanged across M1 / M1 Pro / M1 Max (only M1 Ultra differs — two dies, 32-core).
Repro
- Fresh launch →
loadModels(from:) → Nemotron models loaded successfully.
- Feed audio (
process / processBufferedAudio) — no throw, no tokens, empty transcript.
- Same process, load+run a TDT
SlidingWindowAsrManager first → next Nemotron session works.
- Nemotron-first or Nemotron-after-Nemotron → fails. macOS → always works.
What it is / isn't
Both platforms emit the same compile-time warning (so it's not the cause):
Skipped adding default_function to entry point: main ... PropagateInputTensorShapes failed
when propagating default shape ... ios17.slice_by_index: zero shape error
iOS-only, at ANE runtime (absent on macOS):
ANEProgramProcessRequestDirect() Failed with status=0x12 : statusType=0x9 ... Program Inference error
So the divergence is ANE program instantiation on a cold start, not shape inference or compute units (.cpuAndNeuralEngine/.all/.cpuAndGPU all yield zero output).
NOTE: MLComputePlan resolves the encoder to ANE/CPU identically on working and failing sessions (same device), so the divergence is runtime ANE program instantiation, not the compute plan.
#609 does not fix this
#609 (cache_len = 1) claims to close #607, but its on-device verification box was left unchecked and the warning still appears in 0.15.4. It seeds cache_len at runtime, while the failure is compile-time default-shape propagation against the model's baked-in shapes — a runtime seed can't affect it. The author flagged the real fix as conversion-side (re-trace the encoder with non-zero cache_len); this is that case, plus the iOS-only zero-output consequence the macOS-benign warning hides.
Asks
- Conversion-side fix: re-trace so
ios17.slice_by_index never has a zero-length default shape, so the ANE main entry point is always built.
- Fail loudly: throw from
loadModels/process if the encoder's ANE program can't be instantiated, instead of returning an empty transcript.
Related
Summary
On iPadOS 26.5 (tested on Apple M1),
StreamingNemotronAsrManagerloads "successfully" and accepts audio but decodes no tokens (no partials, empty final) on a cold process start. The identical model files run correctly on macOS. The only way to get output on iOS is to load and run a different FluidAudio CoreML model first (e.g. TDTSlidingWindowAsrManager) — and that prime is consumed per session.Environment
parakeet-nemotron-streaming-0.6b(int8, B1 fused), tiers 560/1120/2240 ms.cpuAndNeuralEngine(also repro'd with.all)Scope (M1 vs M2+)
GB/s), and higher max RAM. The ANE block is unchanged across M1 / M1 Pro / M1 Max (only M1 Ultra differs — two dies, 32-core).
Repro
loadModels(from:)→Nemotron models loaded successfully.process/processBufferedAudio) — no throw, no tokens, empty transcript.SlidingWindowAsrManagerfirst → next Nemotron session works.What it is / isn't
Both platforms emit the same compile-time warning (so it's not the cause):
iOS-only, at ANE runtime (absent on macOS):
So the divergence is ANE program instantiation on a cold start, not shape inference or compute units (
.cpuAndNeuralEngine/.all/.cpuAndGPUall yield zero output).NOTE:
MLComputePlanresolves the encoder toANE/CPUidentically on working and failing sessions (same device), so the divergence is runtime ANE program instantiation, not the compute plan.#609 does not fix this
#609 (
cache_len = 1) claims to close #607, but its on-device verification box was left unchecked and the warning still appears in 0.15.4. It seedscache_lenat runtime, while the failure is compile-time default-shape propagation against the model's baked-in shapes — a runtime seed can't affect it. The author flagged the real fix as conversion-side (re-trace the encoder with non-zerocache_len); this is that case, plus the iOS-only zero-output consequence the macOS-benign warning hides.Asks
ios17.slice_by_indexnever has a zero-length default shape, so the ANEmainentry point is always built.loadModels/processif the encoder's ANE program can't be instantiated, instead of returning an empty transcript.Related
slice_by_indexwarning; does not prevent this functional failure.