investigate(tts): #83 residual — Whisper WER regression on high-intensity (I3+) Tier A clips

Spun out of PR #86 after the bisect that closed #83 turned out to be **only a partial fix**.

## Background

PR #86 reverted PR #70 (inter-word \`<break time=\"50ms\"/>\` tags) and verified the revert resolves the Whisper WER regression on \`sp_neu_a_0001\` (intensity arc \`[1,1,1,2,1]\` — almost all I1):

| variant | duration | WER | length-ratio |
|---|---:|---:|---:|
| sp_neu 04-15 ref | 121.0 s | 0.079 | 1.000 |
| sp_neu current main | 197.8 s | 0.286 | 0.753 |
| sp_neu PR #86 (revert #70+#71) | 122.9 s | **0.048** | 0.995 |

But verifying the same revert on a **high-intensity** scene (\`sp_it_a_0001\`, intensity arc \`[1,2,3,4,5,4,3]\`) shows the regression persists:

| variant | duration | WER | length-ratio |
|---|---:|---:|---:|
| sp_it 04-15 ref | 155.9 s | **0.056** | 1.009 |
| sp_it PR #86 (revert #70+#71) | 146.6 s | **0.322** | 0.709 |

The Hebrew text is byte-identical between the 04-15 reference and the revert branch (verified by \`diff\` — empty). So the residual sp_it WER gap is **purely audio rendering**, not text content.  Length-ratio 0.709 is the same Whisper-silence-detector fingerprint as #83 (~28% of words missing).

## What this means

There is a **second TTS-side change** in the post-2026-04-15 window that fires at I3+. PR #86's central claim of \"sole material cause\" was wrong; #86 lands as a partial fix that closes nothing automatically.  This issue tracks the actual residual cause.

## Suspects (sorted by prior, re-derived)

This list is broader than the original #83 list because the original list missed a candidate (#74).

1. **#74 — Lombard spectral tilt at I4–I5 (May 4, mixer.py).**  Modifies the spectral envelope of high-intensity audio.  Strong prior: it acts at the exact intensities where the residual regression manifests, and changes spectral content (which is what Whisper's encoder consumes).  This was missed in the original #83 suspect list.
2. **#51 (M15) — SSML prosody tuning at I3+.**  At I3/I4/I5 the AGG rate multipliers are 1.06×/1.10×/1.14×.  Was previously dismissed as \"negligible\" based on sp_neu's I1-only arc — that dismissal is invalid for sp_it.
3. **#68 — pitch caps at I4/I5.**  Was dismissed for sp_neu (which never reaches I4); back in scope for sp_it.
4. **#75 — BARGE_IN crossfade.**  Lower prior; only relevant if sp_it has barge-ins.

## Suggested investigation path

1. **Render \`sp_it_a_0001\` on current main** (~17 Azure calls) to establish the actual \"bad\" baseline number we're missing.  PR #86 only has the partial-fix WER, not the un-reverted WER.
2. **Bisect by additionally reverting one suspect at a time** on top of the PR #86 branch (or its merged equivalent), in the order above.  Render \`sp_it_a_0001\` with same \`random_seed\` per step, run Whisper, log WER + length-ratio.
3. **Identify the dominant contributor.**  The PR whose additional revert drops sp_it WER below ~0.10 (the 04-15 baseline range).
4. **Propose remediation to Shay before writing it.**  Same constraint as #83 — do not change SSML/prosody defaults from M15 listening-test calibration without sign-off.

## Reproduction (same harness as #83)

```bash
.venv/bin/python -c \"
from pathlib import Path
import soundfile as sf, torch, sys
from jiwer import wer
from transformers import pipeline
sys.path.insert(0, 'scripts')
from m17_phase_a_validation import normalize_for_wer

asr = pipeline('automatic-speech-recognition', model='openai/whisper-large-v3',
               device=torch.device('mps' if torch.backends.mps.is_available() else 'cpu'),
               torch_dtype=torch.float32, chunk_length_s=30)

for label, path in [
    ('04-15 ref', 'data/m2a_wettest/agg_m_30-45_001/sp_it_a_0001_00.wav'),
]:
    wav, sr = sf.read(path, dtype='float32')
    if wav.ndim > 1: wav = wav.mean(axis=1)
    txt = Path(path).with_suffix('.txt').read_text(encoding='utf-8')
    ref = '\n'.join(l for l in txt.splitlines() if l and not l.startswith('[')).strip()
    out = asr({'raw': wav.copy(), 'sampling_rate': sr},
              generate_kwargs={'language': 'he', 'task': 'transcribe',
                               'num_beams': 1, 'do_sample': False})
    print(f'{label}: WER={wer(normalize_for_wer(ref), normalize_for_wer(out[\\\"text\\\"])):.3f}')
\"
```

The 04-15 reference WAV exists at \`data/m2a_wettest/agg_m_30-45_001/sp_it_a_0001_00.wav\` (155.9 s, WER 0.056).

## Things NOT to do

- **Don't re-investigate #70 / per-word breaks** as the residual cause — already falsified by PR #86.  Test \`test_no_per_word_breaks_in_default_ssml\` in \`tests/unit/test_tts.py\` enforces this.
- **Don't re-investigate loudness** — falsified in #82's lever probe.
- **Don't change SSML / prosody defaults** without listening-test data and Shay's sign-off (#51's M15 values came from listening-test calibration).
- **Don't touch the new clamping / sanitization / break-merge invariants** restored in PR #86 (they're independent of the bisect).
- **Don't conflate this with #62** — Hebrew word merging (also reopened by PR #86) is a separate intelligibility problem.

## References

- PR #86 — partial fix that closed the low-intensity case.
- #83 — the original (overscoped) issue this finishes.
- #62 — Hebrew word merging, reopened separately by PR #86 with alternative-mitigation list.
- #82 — loudness contract, orthogonal to this work.
- #74 — Lombard spectral tilt at I4–I5 (top suspect).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate(tts): #83 residual — Whisper WER regression on high-intensity (I3+) Tier A clips #87

Background

What this means

Suspects (sorted by prior, re-derived)

Suggested investigation path

Reproduction (same harness as #83)

Things NOT to do

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

variant	duration	WER	length-ratio
sp_neu 04-15 ref	121.0 s	0.079	1.000
sp_neu current main	197.8 s	0.286	0.753
sp_neu PR #86 (revert #70+#71)	122.9 s	0.048	0.995

variant	duration	WER	length-ratio
sp_it 04-15 ref	155.9 s	0.056	1.009
sp_it PR #86 (revert #70+#71)	146.6 s	0.322	0.709

investigate(tts): #83 residual — Whisper WER regression on high-intensity (I3+) Tier A clips #87

Description

Background

What this means

Suspects (sorted by prior, re-derived)

Suggested investigation path

Reproduction (same harness as #83)

Things NOT to do

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions