Skip to content

fix(cli,manifest): #108 — write repo-relative paths in clip JSON and manifest#111

Merged
shaypal5 merged 3 commits into
mainfrom
fix/metadata-relative-paths
May 12, 2026
Merged

fix(cli,manifest): #108 — write repo-relative paths in clip JSON and manifest#111
shaypal5 merged 3 commits into
mainfrom
fix/metadata-relative-paths

Conversation

@shaypal5

Copy link
Copy Markdown
Member

Problem

Newly generated ClipMetadata.dirty_file_path / transcript_path and
ManifestRow.wav_path / strong_labels_path were written as absolute,
machine-specific paths (with the developer's home dir baked in).
delivery-002 committed repo-relative paths; delivery-003
(avdp-synth-corpus#4) had to post-process them on the way in. Any
consumer cloning the corpus on a different machine would hit
FileNotFoundError on those paths.

spec.md §5.1 already documents the repo-relative form (updated in #103);
the spec is right, the code was wrong.

Solution

Two coordinated changes, mirroring the recipe in the issue:

  1. synthbanshee/cli.py — add _relative_to_data_root(path, data_root)
    and _infer_data_root(output_dir, override) helpers. Add --data-root
    to generate and generate-batch (envvar SYNTHBANSHEE_DATA_ROOT,
    default = two parents above --output-dir, matching the corpus
    convention <data_root>/data/he/). Pass data_root through to
    _run_generate_pipeline and _render_one; use the helper to rewrite
    dirty_file_path / transcript_path before constructing
    ClipMetadata.

  2. synthbanshee/package/manifest.py — add relative_to: Path | None = None kwarg on generate_manifest; when set, wav_path and
    strong_labels_path are written relative to that root. The CLI's
    generate-batch passes the inferred data root.

Fallback semantics: paths outside the configured data_root keep their
absolute form (no crash). This preserves back-compat for callers that
don't opt in (relative_to=None).

Changes per file

File Change
synthbanshee/cli.py New helpers _relative_to_data_root / _infer_data_root; new --data-root Click option on generate and generate-batch; threaded through _run_generate_pipeline and _render_one; applied to dirty_file_path / transcript_path and to the manifest call
synthbanshee/package/manifest.py New _maybe_relative helper; new relative_to kwarg on generate_manifest; applied to wav_path and strong_labels_path columns
synthbanshee/labels/schema.py Attribute docstrings on dirty_file_path / transcript_path documenting the relative-to-data-root contract
tests/unit/test_cli.py Two new TestRunGeneratePipeline cases — paths are repo-relative under corpus layout, and fall back to absolute when paths sit outside the configured root. Also fixed a mock signature (_fail_then_succeed) to accept the new data_root kwarg
tests/unit/test_manifest.py Two new TestGenerateManifest cases — relative_to rewrites columns; relative_to=None keeps absolute paths (back-compat)

Test plan

  • .venv/bin/pytest tests/unit/test_cli.py::TestRunGeneratePipeline tests/unit/test_manifest.py -v — 33/33 passing (4 new tests + 29 unchanged)
  • .venv/bin/pytest tests/unit/ -q — 1700/1700 passing (full unit suite, no regressions)
  • ruff check + mypy on all changed files — clean
  • Pre-commit hooks (ruff, ruff-format, mypy) — clean at commit time
  • Manual sanity: synthbanshee generate --help shows --data-root (envvar SYNTHBANSHEE_DATA_ROOT) on both commands

Tier-3 ASR sanity (local)

Not applicable — this PR touches synthbanshee/cli.py (path-shape change
only, no audio rendering touched), synthbanshee/package/manifest.py,
and synthbanshee/labels/schema.py (docstrings on str | None fields).
None of synthbanshee/tts/, synthbanshee/script/, synthbanshee/augment/,
nor any speaker / scene / acoustic / project YAML config is modified. Per
CLAUDE.md "ASR sanity check policy", this PR is exempt from the local
qa-report --asr run.

Corpus impact

Existing on-disk corpus clips (delivery-003 included) still have absolute
paths baked in — the next regen will overwrite them. The delivery-003 PR
already post-processed its 20 clips to relative, so no follow-up is
needed on the corpus side.

Fixes #108.

🤖 Generated with Claude Code

…manifest

ClipMetadata.dirty_file_path / transcript_path and ManifestRow.wav_path /
strong_labels_path used to be written as absolute, machine-specific paths
(e.g. /Users/<dev>/clones/avdp-synth-corpus/...). delivery-002 committed
relative paths, and delivery-003 (avdp-synth-corpus#4) had to post-process
them on the way in. spec.md §5.1 already specifies repo-relative form
(updated in #103); this brings the generator in line.

Changes:
- synthbanshee/cli.py: new _relative_to_data_root + _infer_data_root
  helpers; new --data-root option on `generate` and `generate-batch`
  (envvar SYNTHBANSHEE_DATA_ROOT, default = two parents above output-dir,
  matching the corpus convention <data_root>/data/he/). Threaded through
  _run_generate_pipeline and _render_one. dirty_file_path / transcript_path
  go through the helper before metadata is built.
- synthbanshee/package/manifest.py: new relative_to kwarg on
  generate_manifest; applied via _maybe_relative to wav_path and
  strong_labels_path. CLI passes the inferred data root.
- synthbanshee/labels/schema.py: attribute docstrings on dirty_file_path
  / transcript_path documenting the relative-to-data-root contract.
- tests/unit/test_cli.py: two new tests under TestRunGeneratePipeline —
  metadata paths are repo-relative under the corpus layout, and fall
  back to absolute when paths land outside the configured root.
- tests/unit/test_manifest.py: two new tests on generate_manifest —
  relative_to rewrites wav_path / strong_labels_path columns;
  relative_to=None preserves the legacy absolute form (back-compat).

Fixes #108.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 12, 2026 06:30
@shaypal5 shaypal5 added type: fix Bug fix comp: labels Label generation, schema, IAA utilities labels May 12, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes #108 by making generated clip metadata JSON and manifest CSV record repo-relative paths (anchored at a configurable “data root”) instead of absolute, machine-specific paths, improving corpus portability.

Changes:

  • CLI: introduce --data-root (env SYNTHBANSHEE_DATA_ROOT) plus helpers to infer a default root and rewrite dirty_file_path / transcript_path to be relative when possible.
  • Manifest generator: add relative_to kwarg to optionally write wav_path / strong_labels_path relative to a root (default keeps legacy behavior).
  • Tests/docs: add unit coverage for both relative rewriting and absolute fallback; document the contract in ClipMetadata field docstrings.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
synthbanshee/cli.py Adds data-root inference + relative path rewriting for per-clip JSON; wires --data-root through generate + generate-batch and into manifest generation.
synthbanshee/package/manifest.py Adds relative_to support and applies it to path columns via _maybe_relative.
synthbanshee/labels/schema.py Documents the new “relative to data root” contract for path fields.
tests/unit/test_cli.py Adds unit tests for relative path emission and out-of-root absolute fallback; updates a mock signature for the new kwarg.
tests/unit/test_manifest.py Adds unit tests covering relative_to rewriting and relative_to=None back-compat.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread synthbanshee/cli.py Outdated
Comment thread synthbanshee/package/manifest.py Outdated
@github-actions

This comment has been minimized.

…per + add real tests

Self-review caught five gaps in the first cut. This commit addresses them:

1. Silent fallback to absolute path now logs a WARNING. The original
   code returned ``str(path)`` whenever a path landed outside data_root
   — i.e. the same bug #108 was supposed to fix, silently. A
   misconfigured ``--data-root`` is now visible in CLI output via
   ``logger.warning``, making the failure mode loud.

2. Deduped helper. ``_relative_to_data_root`` in cli.py and
   ``_maybe_relative`` in manifest.py were near-identical duplicates.
   Extracted to ``synthbanshee/package/_paths.relative_to_data_root``;
   both call sites delegate to the single helper.

3. Direct tests for ``_infer_data_root`` and ``relative_to_data_root``.
   Previously the helpers were only tested through their call sites
   (which exercise the success path); the new ``TestInferDataRoot``
   and ``TestRelativeToDataRoot`` classes cover every branch including
   the loud-warn fallback. ``test_path_outside_root_falls_back_to_absolute_with_warning``
   uses ``caplog`` to verify the warning fires.

4. End-to-end batch CLI test. The new
   ``test_full_batch_writes_relative_paths_to_manifest_and_clip_json``
   invokes ``generate-batch`` with ``--data-root`` set and asserts the
   resulting ``manifest.csv`` ``wav_path`` column AND the per-clip
   JSON ``transcript_path`` are both repo-relative. Catches the
   regression where ``data_root`` is threaded through one path but not
   the other — both must be relative-anchored, and the previous test
   suite only exercised them in isolation.

5. Parallel ``_render_one`` call now uses keyword args. The
   ``ThreadPoolExecutor.submit`` call mixed positional and keyword
   arguments (kwargs in the sequential path, positional in the
   parallel path); a single signature drift would silently transpose
   ``data_root`` and another arg. Both call sites now use the same
   kwarg shape.

Full unit suite: 1708/1708 passing (up from 1700 — eight new tests).
ruff + mypy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Copilot caught two real gaps in the fallback path:

1. **Relative input + None data_root → relative output.** The original
   ``str(path)`` fallback claimed "absolute form" in the docstring but
   returned whatever shape the caller passed. A relative ``Path("data/he/clip.wav")``
   produced a relative string in the JSON, making metadata depend on the
   process's working directory at write time. Fixed by always calling
   ``Path.resolve()`` before stringifying — every code path now produces
   a stable absolute string when no relative form is achievable.

2. **Windows path separators in JSON.** ``str(Path)`` on Windows emits
   backslashes; the same corpus loaded on Linux then breaks. Fixed by
   using ``Path.as_posix()`` on every path the helper returns, both
   for the relative happy path and the absolute fallback. The corpus
   is now portable across host OSes by contract, not by accident.

New regression tests cover the combined edge cases:
- relative input + data_root=None → absolute output
- relative input + path outside root → absolute output (not "../../foo")
- output strings carry no backslashes regardless of host OS

Full unit suite: 1710/1710 passing (up from 1708 — two new edge-case
tests in TestRelativeToDataRoot).

Resolves Copilot review threads on _relative_to_data_root and
_maybe_relative.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 12, 2026 13:28
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #111 in repository https://github.com/DataHackIL/SynthBanshee

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: synthbanshee/cli.py
URL: https://github.com/DataHackIL/SynthBanshee/pull/111#discussion_r3224209657
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    _relative_to_data_root() claims it “falls back to the legacy absolute form”, but it returns str(path) when data_root is None or when relative_to() fails. If the caller passes a relative Path, this fallback will also be relative, contradicting the docstring and making JSON path shape dependent on the current working directory. Consider returning a resolved absolute string on fallback (and/or clarifying the contract), and using a POSIX string for the relative case so metadata stays stable across OS path separators.

## COPILOT-2
Location: synthbanshee/package/manifest.py
URL: https://github.com/DataHackIL/SynthBanshee/pull/111#discussion_r3224209690
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    _maybe_relative() docstring says it “falls back to the absolute form” when the path is outside root, but the exception branch returns str(path) (which may be relative if data_dir/root were passed as relative Paths). To keep the manifest contract consistent and portable, consider falling back to a resolved absolute path string, and emitting POSIX-style separators for the relative form (similar to how archiving code uses .as_posix()).

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25737600089 attempt 1
Comment timestamp: 2026-05-12T13:30:11.914449+00:00
PR head commit: d78f1fdeb7072ebfb42351b8c11ebee630799a0e

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment on lines 218 to +231
dirty_file_path: str | None = None
"""Path to the retained pre-preprocessing WAV (#108).

Written relative to the data root passed to ``synthbanshee generate``
via ``--data-root`` / ``SYNTHBANSHEE_DATA_ROOT`` (default: two parents
above ``--output-dir``). Older corpus snapshots may still carry
absolute paths; consumers should treat both forms as accepted but
only emit relative paths going forward."""

transcript_path: str | None = None
"""Path to the ``.txt`` transcript file (#108).

Written relative to the data root — see ``dirty_file_path`` for the
anchor convention."""
Comment on lines +194 to +198
def test_relative_to_none_keeps_absolute_paths(self, tmp_path):
"""Default ``relative_to=None`` preserves pre-#108 absolute-path behaviour.

Documents the back-compat contract — callers that don't opt into
the new ``relative_to`` argument see no behaviour change.
@shaypal5 shaypal5 merged commit 58b047f into main May 12, 2026
10 checks passed
@shaypal5 shaypal5 deleted the fix/metadata-relative-paths branch May 12, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: labels Label generation, schema, IAA utilities type: fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(cli,manifest): generated metadata records absolute, machine-specific paths

2 participants