Skip to content

Nemotron streaming (int8) produces zero output on a cold start on iPadOS 26.5 (Apple M1); works on macOS (M1 Pro) #739

Description

@SpiraMira

Summary

On iPadOS 26.5 (tested on Apple M1), StreamingNemotronAsrManager loads "successfully" and accepts audio but decodes no tokens (no partials, empty final) on a cold process start. The identical model files run correctly on macOS. The only way to get output on iOS is to load and run a different FluidAudio CoreML model first (e.g. TDT SlidingWindowAsrManager) — and that prime is consumed per session.

Environment

  • Fails (tested): iPad, Apple M1, 8 GB, iPadOS 26.5
  • Works: MacBook Pro, Apple M1 Pro, 16 GB, macOS 26.5
  • FluidAudio: 0.15.4 · Model: parakeet-nemotron-streaming-0.6b (int8, B1 fused), tiers 560/1120/2240 ms
  • Compute units: default .cpuAndNeuralEngine (also repro'd with .all)

Scope (M1 vs M2+)

  • Only M1 tested. The macOS box is M1 Pro — the same-generation 16-core ANE (~11 TOPS) as the failing M1 iPad — so an identical-gen ANE runs this model fine under macOS, pointing the failure at the iPadOS 26.5 CoreML/ANE runtime, not ANE silicon or RAM. Newer ANE (M2 ~15.8, M3 ~18, M4 / A17 Pro+ ~35–38 TOPS, a newer design) is untested — this is not an "all iOS" claim.
  • No iOS validation documented upstream. The HF card's only benchmark is "Tested on Apple M2 with FluidAudio" (LibriSpeech WER/RTFx — the desktop/CLI path); no iOS/iPadOS run at any tier. The card is also stale: it lists 1120/560/160/80 ms but not the shipped 2240 ms tier (and still lists 160/80 ms, which 0.15.4 drops) — iOS is entirely outside the documented tested envelope.
  • RAM differs (8 vs 16 GB), but the workaround (load an extra model first) uses more memory yet fixes it — arguing against OOM. A 16 GB M2/M4 iPad repro would retire both the RAM and ANE-generation variables at once.
  • NOTE: the Neural Engine is the same on both devices. M1 and M1 Pro ship the identical 16-core ANE, ~11 TOPS, same microarchitecture. M1 Pro's advantages over M1 are all elsewhere: more CPU/GPU cores, much higher memory bandwidth (~200 vs ~68
    GB/s), and higher max RAM. The ANE block is unchanged across M1 / M1 Pro / M1 Max (only M1 Ultra differs — two dies, 32-core).

Repro

  1. Fresh launch → loadModels(from:)Nemotron models loaded successfully.
  2. Feed audio (process / processBufferedAudio) — no throw, no tokens, empty transcript.
  3. Same process, load+run a TDT SlidingWindowAsrManager first → next Nemotron session works.
  4. Nemotron-first or Nemotron-after-Nemotron → fails. macOS → always works.

What it is / isn't

Both platforms emit the same compile-time warning (so it's not the cause):

Skipped adding default_function to entry point: main ... PropagateInputTensorShapes failed
  when propagating default shape ... ios17.slice_by_index: zero shape error

iOS-only, at ANE runtime (absent on macOS):

ANEProgramProcessRequestDirect() Failed with status=0x12 : statusType=0x9 ... Program Inference error

So the divergence is ANE program instantiation on a cold start, not shape inference or compute units (.cpuAndNeuralEngine/.all/.cpuAndGPU all yield zero output).

NOTE: MLComputePlan resolves the encoder to ANE/CPU identically on working and failing sessions (same device), so the divergence is runtime ANE program instantiation, not the compute plan.

#609 does not fix this

#609 (cache_len = 1) claims to close #607, but its on-device verification box was left unchecked and the warning still appears in 0.15.4. It seeds cache_len at runtime, while the failure is compile-time default-shape propagation against the model's baked-in shapes — a runtime seed can't affect it. The author flagged the real fix as conversion-side (re-trace the encoder with non-zero cache_len); this is that case, plus the iOS-only zero-output consequence the macOS-benign warning hides.

Asks

  1. Conversion-side fix: re-trace so ios17.slice_by_index never has a zero-length default shape, so the ANE main entry point is always built.
  2. Fail loudly: throw from loadModels/process if the encoder's ANE program can't be instantiated, instead of returning an empty transcript.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions