Skip to content

fix(tts): #72 — phrase prosody volume must be %, not dB (root cause)#106

Merged
shaypal5 merged 1 commit into
mainfrom
fix/ssml-phrase-prosody-volume-units
May 11, 2026
Merged

fix(tts): #72 — phrase prosody volume must be %, not dB (root cause)#106
shaypal5 merged 1 commit into
mainfrom
fix/ssml-phrase-prosody-volume-units

Conversation

@shaypal5

Copy link
Copy Markdown
Member

Problem

Reliable root cause for the long-standing intermittent #72: Azure SSML parsing error 0x80045003 / Connection was closed by the remote host.

Reproduced during the delivery-003 corpus regen (avdp-synth-corpus, on top of recent PRs #102/#103/#105): 6 of 8 elephant Tier B scenes fail every time on the first uncached turn. Bisected with 9 hand-crafted A/B SSML round-trips against live Azure.

What triggers it

When the M2b phrase-prosody system fires a stress hint, the renderer produces nested <prosody> with the inner volume in dB and the outer in %:

<prosody rate="+9%" pitch="+7%" volume="+5%">     <!-- outer: _volume_to_string emits % -->
  text...
  <prosody rate="+15%" pitch="+1st" volume="+3dB">stress span</prosody>  <!-- _HINT_DEFAULTS["stress"] -->
  text...
</prosody>

Azure rejects volume unit mismatch (dB inner under % outer). Pitch unit mismatch (st under %) is tolerated. Mid-word <break /> is tolerated. Both confirmed against live Azure.

Fix

_HINT_DEFAULTS["stress"]["volume"]: "+3dB""+3%". Matches the lossy 1:1 dB→% mapping that _volume_to_string already uses for the outer prosody. No SSML-builder code changes — just the hint default + docstring update.

A/B isolation results (live Azure)

Test Outer Inner Result
Original (the bug) vol="+5%" vol="+3dB", mid-word FAIL
No nested prosody vol="+5%" OK
Mid-word <break /> (no nested prosody) vol="+5%" break only OK
Word-aligned nest, mixed units vol="+5%" vol="+3dB" FAIL
Word-aligned nest, all-% units vol="+5%" vol="+3%" OK
Word-aligned nest, pitch="+1st", no inner volume vol="+5%" pitch="+1st" OK
Word-aligned nest, pitch="+6%", volume="+3dB" vol="+5%" vol="+3dB" FAIL
Word-aligned nest, pitch="+1st", volume="+3%" vol="+5%" vol="+3%" OK

Volume dB inside volume % is the trigger; nothing else.

Files changed

File Change
synthbanshee/tts/ssml_types.py _HINT_DEFAULTS["stress"]["volume"]: "+3dB""+3%"; PhraseProsody.volume docstring updated to pin the %-only constraint and reference #72
tests/unit/test_phrase_prosody.py test_hint_defaults_applied_stress updated to assert "+3%"; new class TestHintDefaultUnits — two structural tests pinning the "no dB for volume" / "must end in %" invariant across _HINT_DEFAULTS

Test plan

Tier-3 ASR sanity (local)

This change will alter audio output for any scene that emits a stress phrase hint — the inner prosody volume drops from a real +3 dB (~+41% linear) to the lossy synthbanshee convention of +3% (matching how the outer prosody has always been emitted). Per CLAUDE.md's ASR sanity policy, this is in-scope; will run qa-report --asr on the delivery-003 corpus once those scenes regenerate and paste the result into the upcoming corpus PR.

Unblocks

avdp-synth-corpus delivery-003 (in-flight) — the 6 elephant Tier B scenes that hit this can now regenerate cleanly.

Refs #72.

🤖 Generated with Claude Code

While running the delivery-003 corpus regen, 6 of 8 elephant Tier B
scenes reliably hit #72 (`Azure SSML parsing error 0x80045003`).
Bisected the failing SSML and isolated the trigger:

  <prosody volume="+5%">                ← outer, from _volume_to_string
    text
    <prosody volume="+3dB">stress</prosody>  ← inner, from _HINT_DEFAULTS
    text
  </prosody>

Confirmed against Azure with 9 A/B SSML tests:

  - nested word-aligned, all-% units              → OK
  - nested word-aligned, inner pitch="+1st"+vol%  → OK
  - nested word-aligned, inner pitch=%+vol="+3dB" → FAIL
  - nested mid-word, mixed units                  → FAIL
  - mid-word <break /> (no nested prosody)        → OK

Pitch unit mismatch (`+1st` inner inside `+N%` outer) is tolerated;
volume unit mismatch (`+NdB` inside `+N%`) is the trigger.

Fix: `_HINT_DEFAULTS["stress"]["volume"]` changed from `"+3dB"` to
`"+3%"`.  This matches the lossy 1:1 dB→% mapping convention that
`_volume_to_string` already uses, so the inner and outer prosody
elements live in the same unit system.

Two regression tests added:

1. `test_no_hint_default_uses_db_for_volume` — structural check that
   no entry in `_HINT_DEFAULTS` emits volume in `dB`, since the outer
   emitter is always `%`.
2. `test_hint_default_volumes_parse_as_percent` — companion: any
   volume default must end in `%` and parse as numeric.

Updates the `PhraseProsody.volume` docstring to explain the invariant
and reference #72.

Reliable repro from the delivery-003 attempt: any elephant Tier B
scene with intensity ≥ 3 (where the LLM emits `stress` hints on
aggressive BEN turns) hits this; the failing scene/turn manifest is
captured in `/tmp/ssml-diag/intercept_call_01.{xml,status}` during
investigation.  After this fix, re-running those 6 scenes succeeds.

Test plan:
  - `pytest tests/unit/` — 1696 passed (1694 + 2 new)
  - `ruff check synthbanshee/ tests/` — clean
  - Manual Azure round-trip with TEST H (nested, vol=%) confirms
    the fix on live Azure.

Refs #72.  Unblocks delivery-003 corpus PR (avdp-synth-corpus).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5 shaypal5 added type: fix Bug fix comp: tts TTS rendering, SSML, Azure/Google providers labels May 11, 2026
Copilot AI review requested due to automatic review settings May 11, 2026 21:47
@shaypal5 shaypal5 added type: fix Bug fix comp: tts TTS rendering, SSML, Azure/Google providers labels May 11, 2026
@shaypal5 shaypal5 merged commit d92d61e into main May 11, 2026
5 checks passed
@shaypal5 shaypal5 deleted the fix/ssml-phrase-prosody-volume-units branch May 11, 2026 21:47

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the root cause of Azure SSML parse failures (#72) by ensuring phrase-level prosody volume defaults use % units (matching the outer <prosody volume="..."> emitter), preventing invalid nested unit combinations.

Changes:

  • Update _HINT_DEFAULTS["stress"]["volume"] from "+3dB" to "+3%" to avoid Azure nested <prosody> volume unit mismatch.
  • Clarify PhraseProsody.volume docstring to document the %-only constraint and link the Azure failure mode (#72).
  • Strengthen unit tests to pin the “no dB volumes in hint defaults” invariant and update existing stress-default assertion.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
synthbanshee/tts/ssml_types.py Switch stress hint default volume to % and document the %-only nesting constraint to prevent Azure SSML errors.
tests/unit/test_phrase_prosody.py Update stress default expectation and add structural tests enforcing % volume units in _HINT_DEFAULTS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #106 in repository https://github.com/DataHackIL/SynthBanshee. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: pull request opened
Workflow run: 25699253567 attempt 1
Comment timestamp: 2026-05-11T21:48:57.158120+00:00
PR head commit: fb2e0da6553f1f91087abc549d4ab15981de2196

shaypal5 added a commit to DataHackIL/avdp-synth-corpus that referenced this pull request May 12, 2026
* feat(delivery-003): 20-clip multi-project toy corpus on synthbanshee main

Replaces delivery 002.  First handoff target for the She-Proves and
Elephant consumer teams.

## Contents

- **She-Proves Tier A — Azure pair (10 clips)** in `agg_m_30-45_001/`:
  2 IT, 2 SV, 3 NEG, 3 NEU (Avri + Hila).
- **She-Proves Tier A — Google Chirp HD pair (2 clips)** in
  `agg_m_30-45_002/`: 1 IT, 1 SV (sister scenes to sp_*_a_0001,
  authored as PR DataHackIL/SynthBanshee#105).  Provides the
  voice + backend diversity vehicle for this delivery.
- **Elephant Tier B (8 clips)** in `ben_m_40-55_003/`: 2 each of
  IT/SV/NEG/NEU with `acoustic_scene` (clinic_office room IR +
  pi_budget_mic device + HVAC ambient).

Total: 20 clips, ~41.7 min.  All pass `synthbanshee validate` and
`synthbanshee qa-report` (failure rate 0.0%).  Full QA snapshot at
[`deliveries/003-multi-project-multi-voice/qa-report.json`](deliveries/003-multi-project-multi-voice/qa-report.json).

## Pipeline corrections delivered

This delivery is the first to surface 4 synthbanshee fixes landed in
the past day:

- DataHackIL/SynthBanshee#102 — `preprocessing_applied.normalized_dbfs`
  now records the *measured* post-preprocess peak (was hardcoded
  `-1.0`).  Pair with `generation_metadata.loudness_target_peak_dbfs`
  to diagnose loudness drift; the schema docstring at
  `labels/schema.py:175` pins the measured-vs-target split.
- DataHackIL/SynthBanshee#103 — `docs/spec.md` pins the
  `has_violence` derivation rule (`any(e.tier1_category != "NONE")`),
  adds the §2.5 identifier-casing table, rewrites §5.1 field notes.
- DataHackIL/SynthBanshee#105 — adds `sp_sv_a_0003` + `sp_it_a_0003`
  Google-pair shadow scenes.
- DataHackIL/SynthBanshee#106 — root cause for #72: `_HINT_DEFAULTS`
  was emitting nested `<prosody volume="+NdB">` inside outer
  `<prosody volume="+N%">`, which Azure rejects with SSML parse
  error 0x80045003.  Required to unblock 6 of 8 elephant Tier B
  scenes; without the fix, every scene whose LLM script carries a
  `stress` phrase hint at intensity ≥ 3 failed reliably.

## Doc updates in this PR

- `README.md`: tightened "Clip ID and filename conventions" to
  point at SynthBanshee `docs/spec.md` §2.5; rewrote the
  `has_violence` paragraph to the events-based rule; updated the
  audio-format section to the measured-vs-target split; replaced
  the v1-limitations block with a pointer to per-delivery notes.
- `CLAUDE.md`: replaced the wrong `has_violence` formula with the
  events-based rule; expanded the audio-format table to match the
  spec's measured-vs-target distinction.
- `DELIVERIES.md`: delivery 002 marked `superseded`; new row for 003.
- `deliveries/003-multi-project-multi-voice/`:
  - `metadata.yaml` — structured delivery record.
  - `notes.md` — full per-clip table, voice/backend matrix,
    closed-vs-open qa-report findings.
  - `qa-report.json` — raw qa-report output (committed for audit).

## QA snapshot

Closed since delivery 002:

| Finding | 002 | 003 |
|---|---|---|
| `agg_no_escalation` | 3 clips | 0 |
| `warn_no_overlap` | 4 clips | 0 (overlap_ratio 100% on I4+) |
| `warn_emotion_downgrade` | 4 clips | 0 |
| `generation_metadata` absent | 0 of 8 had it | 20 of 20 have it |
| `dirty_file_path` null | 7 of 8 | 0 of 20 |
| `normalized_dbfs` hardcoded `-1.0` | 8 of 8 | fixed (#102) |

Still open: `low_voice_diversity_*` (now 2 voices per gender, threshold
is ≥3 — partial progress 1 → 2); `single_backend` (misleading; see
notes for explanation of the hardcoded `tts_engine` labeling bug);
`vic_f0_high` on the 2 Google Chirp HD female-voice clips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: tts TTS rendering, SSML, Azure/Google providers type: fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants