Skip to content

investigate(tts): #83 residual — Whisper WER regression on high-intensity (I3+) Tier A clips #87

@shaypal5

Description

@shaypal5

Spun out of PR #86 after the bisect that closed #83 turned out to be only a partial fix.

Background

PR #86 reverted PR #70 (inter-word `<break time="50ms"/>` tags) and verified the revert resolves the Whisper WER regression on `sp_neu_a_0001` (intensity arc `[1,1,1,2,1]` — almost all I1):

variant duration WER length-ratio
sp_neu 04-15 ref 121.0 s 0.079 1.000
sp_neu current main 197.8 s 0.286 0.753
sp_neu PR #86 (revert #70+#71) 122.9 s 0.048 0.995

But verifying the same revert on a high-intensity scene (`sp_it_a_0001`, intensity arc `[1,2,3,4,5,4,3]`) shows the regression persists:

variant duration WER length-ratio
sp_it 04-15 ref 155.9 s 0.056 1.009
sp_it PR #86 (revert #70+#71) 146.6 s 0.322 0.709

The Hebrew text is byte-identical between the 04-15 reference and the revert branch (verified by `diff` — empty). So the residual sp_it WER gap is purely audio rendering, not text content. Length-ratio 0.709 is the same Whisper-silence-detector fingerprint as #83 (~28% of words missing).

What this means

There is a second TTS-side change in the post-2026-04-15 window that fires at I3+. PR #86's central claim of "sole material cause" was wrong; #86 lands as a partial fix that closes nothing automatically. This issue tracks the actual residual cause.

Suspects (sorted by prior, re-derived)

This list is broader than the original #83 list because the original list missed a candidate (#74).

  1. fix(mixer): #65 Lombard spectral tilt at I4–I5 #74 — Lombard spectral tilt at I4–I5 (May 4, mixer.py). Modifies the spectral envelope of high-intensity audio. Strong prior: it acts at the exact intensities where the residual regression manifests, and changes spectral content (which is what Whisper's encoder consumes). This was missed in the original investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83 suspect list.
  2. feat(m15): SSML prosody tuning with research-validated Hebrew parameters #51 (M15) — SSML prosody tuning at I3+. At I3/I4/I5 the AGG rate multipliers are 1.06×/1.10×/1.14×. Was previously dismissed as "negligible" based on sp_neu's I1-only arc — that dismissal is invalid for sp_it.
  3. fix(config): halve pitch escalation at I4–I5 to eliminate helium effect #68 — pitch caps at I4/I5. Was dismissed for sp_neu (which never reaches I4); back in scope for sp_it.
  4. fix(mixer): #66 BARGE_IN audible-overlap crossfade past TTS trailing silence #75 — BARGE_IN crossfade. Lower prior; only relevant if sp_it has barge-ins.

Suggested investigation path

  1. Render `sp_it_a_0001` on current main (~17 Azure calls) to establish the actual "bad" baseline number we're missing. PR fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only) #86 only has the partial-fix WER, not the un-reverted WER.
  2. Bisect by additionally reverting one suspect at a time on top of the PR fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only) #86 branch (or its merged equivalent), in the order above. Render `sp_it_a_0001` with same `random_seed` per step, run Whisper, log WER + length-ratio.
  3. Identify the dominant contributor. The PR whose additional revert drops sp_it WER below ~0.10 (the 04-15 baseline range).
  4. Propose remediation to Shay before writing it. Same constraint as investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83 — do not change SSML/prosody defaults from M15 listening-test calibration without sign-off.

Reproduction (same harness as #83)

.venv/bin/python -c \"
from pathlib import Path
import soundfile as sf, torch, sys
from jiwer import wer
from transformers import pipeline
sys.path.insert(0, 'scripts')
from m17_phase_a_validation import normalize_for_wer

asr = pipeline('automatic-speech-recognition', model='openai/whisper-large-v3',
               device=torch.device('mps' if torch.backends.mps.is_available() else 'cpu'),
               torch_dtype=torch.float32, chunk_length_s=30)

for label, path in [
    ('04-15 ref', 'data/m2a_wettest/agg_m_30-45_001/sp_it_a_0001_00.wav'),
]:
    wav, sr = sf.read(path, dtype='float32')
    if wav.ndim > 1: wav = wav.mean(axis=1)
    txt = Path(path).with_suffix('.txt').read_text(encoding='utf-8')
    ref = '\n'.join(l for l in txt.splitlines() if l and not l.startswith('[')).strip()
    out = asr({'raw': wav.copy(), 'sampling_rate': sr},
              generate_kwargs={'language': 'he', 'task': 'transcribe',
                               'num_beams': 1, 'do_sample': False})
    print(f'{label}: WER={wer(normalize_for_wer(ref), normalize_for_wer(out[\\\"text\\\"])):.3f}')
\"

The 04-15 reference WAV exists at `data/m2a_wettest/agg_m_30-45_001/sp_it_a_0001_00.wav` (155.9 s, WER 0.056).

Things NOT to do

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions