Nemotron streaming (int8) produces zero output on a cold start on iPadOS 26.5 (Apple M1); works on macOS (M1 Pro)


## Summary

On **iPadOS 26.5 (tested on Apple M1)**, `StreamingNemotronAsrManager` loads "successfully" and accepts audio but decodes **no tokens** (no partials, empty final) on a **cold process start**. The **identical model files run correctly on macOS**. The only way to get output on iOS is to load and run a *different* FluidAudio CoreML model first (e.g. TDT `SlidingWindowAsrManager`) — and that prime is consumed per session.

## Environment

- **Fails (tested)**: iPad, Apple **M1, 8 GB**, **iPadOS 26.5**
- **Works**: MacBook Pro, Apple **M1 Pro, 16 GB**, **macOS 26.5**
- **FluidAudio**: 0.15.4 · **Model**: `parakeet-nemotron-streaming-0.6b` (int8, B1 fused), tiers 560/1120/2240 ms
- **Compute units**: default `.cpuAndNeuralEngine` (also repro'd with `.all`)

## Scope (M1 vs M2+)

- **Only M1 tested.** The macOS box is **M1 Pro — the same-generation 16-core ANE (~11 TOPS) as the failing M1 iPad** — so an identical-gen ANE runs this model fine under macOS, pointing the failure at the **iPadOS 26.5 CoreML/ANE runtime**, not ANE silicon or RAM. Newer ANE (M2 ~15.8, M3 ~18, **M4 / A17 Pro+ ~35–38 TOPS**, a newer design) is **untested** — this is not an "all iOS" claim.
- **No iOS validation documented upstream.** The HF card's only benchmark is *"Tested on Apple M2 with FluidAudio"* (LibriSpeech WER/RTFx — the desktop/CLI path); no iOS/iPadOS run at any tier. The card is also **stale**: it lists 1120/560/160/80 ms but not the shipped **2240 ms** tier (and still lists 160/80 ms, which 0.15.4 drops) — iOS is entirely outside the documented tested envelope.
- **RAM differs (8 vs 16 GB)**, but the workaround (load an *extra* model first) uses *more* memory yet fixes it — arguing against OOM. A **16 GB M2/M4 iPad** repro would retire both the RAM and ANE-generation variables at once.
- NOTE:  the Neural Engine is the same on both devices. M1 and M1 Pro ship the identical 16-core ANE, ~11 TOPS, same microarchitecture. M1 Pro's advantages over M1 are all elsewhere: more CPU/GPU cores, much higher memory bandwidth (~200 vs ~68
  GB/s), and higher max RAM. The ANE block is unchanged across M1 / M1 Pro / M1 Max (only M1 Ultra differs — two dies, 32-core).

## Repro

1. Fresh launch → `loadModels(from:)` → `Nemotron models loaded successfully`.
2. Feed audio (`process` / `processBufferedAudio`) — no throw, **no tokens, empty transcript**.
3. Same process, load+run a TDT `SlidingWindowAsrManager` first → next Nemotron session **works**.
4. Nemotron-first or Nemotron-after-Nemotron → fails. macOS → always works.

## What it is / isn't

Both platforms emit the same compile-time warning (so it's **not** the cause):

```text
Skipped adding default_function to entry point: main ... PropagateInputTensorShapes failed
  when propagating default shape ... ios17.slice_by_index: zero shape error
```

iOS-only, at ANE **runtime** (absent on macOS):

```text
ANEProgramProcessRequestDirect() Failed with status=0x12 : statusType=0x9 ... Program Inference error
```

So the divergence is **ANE program instantiation on a cold start**, not shape inference or compute units (`.cpuAndNeuralEngine`/`.all`/`.cpuAndGPU` all yield zero output).

NOTE: `MLComputePlan` resolves the encoder to `ANE/CPU` **identically on working and failing sessions (same device)**, so the divergence is **runtime ANE program instantiation, not the compute plan**.

## #609 does not fix this

#609 (`cache_len = 1`) claims to close #607, but its on-device verification box was left unchecked and the warning **still appears in 0.15.4**. It seeds `cache_len` at **runtime**, while the failure is **compile-time default-shape** propagation against the model's baked-in shapes — a runtime seed can't affect it. The author flagged the real fix as conversion-side (re-trace the encoder with non-zero `cache_len`); this is that case, plus the iOS-only zero-output consequence the macOS-benign warning hides.

## Asks

1. **Conversion-side fix**: re-trace so `ios17.slice_by_index` never has a zero-length default shape, so the ANE `main` entry point is always built.
2. **Fail loudly**: throw from `loadModels`/`process` if the encoder's ANE program can't be instantiated, instead of returning an empty transcript.

## Related

- #607 / #609 — same `slice_by_index` *warning*; does not prevent this functional failure.
- #641 — onset warmup for *accuracy*, not zero output.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nemotron streaming (int8) produces zero output on a cold start on iPadOS 26.5 (Apple M1); works on macOS (M1 Pro) #739

Summary

Environment

Scope (M1 vs M2+)

Repro

What it is / isn't

#609 does not fix this

Asks

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Nemotron streaming (int8) produces zero output on a cold start on iPadOS 26.5 (Apple M1); works on macOS (M1 Pro) #739

Description

Summary

Environment

Scope (M1 vs M2+)

Repro

What it is / isn't

#609 does not fix this

Asks

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions