fix(cli,manifest): #108 — write repo-relative paths in clip JSON and manifest#111
Merged
Conversation
…manifest ClipMetadata.dirty_file_path / transcript_path and ManifestRow.wav_path / strong_labels_path used to be written as absolute, machine-specific paths (e.g. /Users/<dev>/clones/avdp-synth-corpus/...). delivery-002 committed relative paths, and delivery-003 (avdp-synth-corpus#4) had to post-process them on the way in. spec.md §5.1 already specifies repo-relative form (updated in #103); this brings the generator in line. Changes: - synthbanshee/cli.py: new _relative_to_data_root + _infer_data_root helpers; new --data-root option on `generate` and `generate-batch` (envvar SYNTHBANSHEE_DATA_ROOT, default = two parents above output-dir, matching the corpus convention <data_root>/data/he/). Threaded through _run_generate_pipeline and _render_one. dirty_file_path / transcript_path go through the helper before metadata is built. - synthbanshee/package/manifest.py: new relative_to kwarg on generate_manifest; applied via _maybe_relative to wav_path and strong_labels_path. CLI passes the inferred data root. - synthbanshee/labels/schema.py: attribute docstrings on dirty_file_path / transcript_path documenting the relative-to-data-root contract. - tests/unit/test_cli.py: two new tests under TestRunGeneratePipeline — metadata paths are repo-relative under the corpus layout, and fall back to absolute when paths land outside the configured root. - tests/unit/test_manifest.py: two new tests on generate_manifest — relative_to rewrites wav_path / strong_labels_path columns; relative_to=None preserves the legacy absolute form (back-compat). Fixes #108. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Fixes #108 by making generated clip metadata JSON and manifest CSV record repo-relative paths (anchored at a configurable “data root”) instead of absolute, machine-specific paths, improving corpus portability.
Changes:
- CLI: introduce
--data-root(envSYNTHBANSHEE_DATA_ROOT) plus helpers to infer a default root and rewritedirty_file_path/transcript_pathto be relative when possible. - Manifest generator: add
relative_tokwarg to optionally writewav_path/strong_labels_pathrelative to a root (default keeps legacy behavior). - Tests/docs: add unit coverage for both relative rewriting and absolute fallback; document the contract in
ClipMetadatafield docstrings.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| synthbanshee/cli.py | Adds data-root inference + relative path rewriting for per-clip JSON; wires --data-root through generate + generate-batch and into manifest generation. |
| synthbanshee/package/manifest.py | Adds relative_to support and applies it to path columns via _maybe_relative. |
| synthbanshee/labels/schema.py | Documents the new “relative to data root” contract for path fields. |
| tests/unit/test_cli.py | Adds unit tests for relative path emission and out-of-root absolute fallback; updates a mock signature for the new kwarg. |
| tests/unit/test_manifest.py | Adds unit tests covering relative_to rewriting and relative_to=None back-compat. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This comment has been minimized.
This comment has been minimized.
…per + add real tests Self-review caught five gaps in the first cut. This commit addresses them: 1. Silent fallback to absolute path now logs a WARNING. The original code returned ``str(path)`` whenever a path landed outside data_root — i.e. the same bug #108 was supposed to fix, silently. A misconfigured ``--data-root`` is now visible in CLI output via ``logger.warning``, making the failure mode loud. 2. Deduped helper. ``_relative_to_data_root`` in cli.py and ``_maybe_relative`` in manifest.py were near-identical duplicates. Extracted to ``synthbanshee/package/_paths.relative_to_data_root``; both call sites delegate to the single helper. 3. Direct tests for ``_infer_data_root`` and ``relative_to_data_root``. Previously the helpers were only tested through their call sites (which exercise the success path); the new ``TestInferDataRoot`` and ``TestRelativeToDataRoot`` classes cover every branch including the loud-warn fallback. ``test_path_outside_root_falls_back_to_absolute_with_warning`` uses ``caplog`` to verify the warning fires. 4. End-to-end batch CLI test. The new ``test_full_batch_writes_relative_paths_to_manifest_and_clip_json`` invokes ``generate-batch`` with ``--data-root`` set and asserts the resulting ``manifest.csv`` ``wav_path`` column AND the per-clip JSON ``transcript_path`` are both repo-relative. Catches the regression where ``data_root`` is threaded through one path but not the other — both must be relative-anchored, and the previous test suite only exercised them in isolation. 5. Parallel ``_render_one`` call now uses keyword args. The ``ThreadPoolExecutor.submit`` call mixed positional and keyword arguments (kwargs in the sequential path, positional in the parallel path); a single signature drift would silently transpose ``data_root`` and another arg. Both call sites now use the same kwarg shape. Full unit suite: 1708/1708 passing (up from 1700 — eight new tests). ruff + mypy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Copilot caught two real gaps in the fallback path:
1. **Relative input + None data_root → relative output.** The original
``str(path)`` fallback claimed "absolute form" in the docstring but
returned whatever shape the caller passed. A relative ``Path("data/he/clip.wav")``
produced a relative string in the JSON, making metadata depend on the
process's working directory at write time. Fixed by always calling
``Path.resolve()`` before stringifying — every code path now produces
a stable absolute string when no relative form is achievable.
2. **Windows path separators in JSON.** ``str(Path)`` on Windows emits
backslashes; the same corpus loaded on Linux then breaks. Fixed by
using ``Path.as_posix()`` on every path the helper returns, both
for the relative happy path and the absolute fallback. The corpus
is now portable across host OSes by contract, not by accident.
New regression tests cover the combined edge cases:
- relative input + data_root=None → absolute output
- relative input + path outside root → absolute output (not "../../foo")
- output strings carry no backslashes regardless of host OS
Full unit suite: 1710/1710 passing (up from 1708 — two new edge-case
tests in TestRelativeToDataRoot).
Resolves Copilot review threads on _relative_to_data_root and
_maybe_relative.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #111 in repository https://github.com/DataHackIL/SynthBanshee
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: synthbanshee/cli.py
URL: https://github.com/DataHackIL/SynthBanshee/pull/111#discussion_r3224209657
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
_relative_to_data_root() claims it “falls back to the legacy absolute form”, but it returns str(path) when data_root is None or when relative_to() fails. If the caller passes a relative Path, this fallback will also be relative, contradicting the docstring and making JSON path shape dependent on the current working directory. Consider returning a resolved absolute string on fallback (and/or clarifying the contract), and using a POSIX string for the relative case so metadata stays stable across OS path separators.
## COPILOT-2
Location: synthbanshee/package/manifest.py
URL: https://github.com/DataHackIL/SynthBanshee/pull/111#discussion_r3224209690
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
_maybe_relative() docstring says it “falls back to the absolute form” when the path is outside root, but the exception branch returns str(path) (which may be relative if data_dir/root were passed as relative Paths). To keep the manifest contract consistent and portable, consider falling back to a resolved absolute path string, and emitting POSIX-style separators for the relative form (similar to how archiving code uses .as_posix()).Run metadata: |
Comment on lines
218
to
+231
| dirty_file_path: str | None = None | ||
| """Path to the retained pre-preprocessing WAV (#108). | ||
|
|
||
| Written relative to the data root passed to ``synthbanshee generate`` | ||
| via ``--data-root`` / ``SYNTHBANSHEE_DATA_ROOT`` (default: two parents | ||
| above ``--output-dir``). Older corpus snapshots may still carry | ||
| absolute paths; consumers should treat both forms as accepted but | ||
| only emit relative paths going forward.""" | ||
|
|
||
| transcript_path: str | None = None | ||
| """Path to the ``.txt`` transcript file (#108). | ||
|
|
||
| Written relative to the data root — see ``dirty_file_path`` for the | ||
| anchor convention.""" |
Comment on lines
+194
to
+198
| def test_relative_to_none_keeps_absolute_paths(self, tmp_path): | ||
| """Default ``relative_to=None`` preserves pre-#108 absolute-path behaviour. | ||
|
|
||
| Documents the back-compat contract — callers that don't opt into | ||
| the new ``relative_to`` argument see no behaviour change. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Newly generated
ClipMetadata.dirty_file_path/transcript_pathandManifestRow.wav_path/strong_labels_pathwere written as absolute,machine-specific paths (with the developer's home dir baked in).
delivery-002 committed repo-relative paths; delivery-003
(
avdp-synth-corpus#4) had to post-process them on the way in. Anyconsumer cloning the corpus on a different machine would hit
FileNotFoundErroron those paths.spec.md§5.1 already documents the repo-relative form (updated in #103);the spec is right, the code was wrong.
Solution
Two coordinated changes, mirroring the recipe in the issue:
synthbanshee/cli.py— add_relative_to_data_root(path, data_root)and
_infer_data_root(output_dir, override)helpers. Add--data-rootto
generateandgenerate-batch(envvarSYNTHBANSHEE_DATA_ROOT,default = two parents above
--output-dir, matching the corpusconvention
<data_root>/data/he/). Passdata_rootthrough to_run_generate_pipelineand_render_one; use the helper to rewritedirty_file_path/transcript_pathbefore constructingClipMetadata.synthbanshee/package/manifest.py— addrelative_to: Path | None = Nonekwarg ongenerate_manifest; when set,wav_pathandstrong_labels_pathare written relative to that root. The CLI'sgenerate-batchpasses the inferred data root.Fallback semantics: paths outside the configured
data_rootkeep theirabsolute form (no crash). This preserves back-compat for callers that
don't opt in (
relative_to=None).Changes per file
synthbanshee/cli.py_relative_to_data_root/_infer_data_root; new--data-rootClick option ongenerateandgenerate-batch; threaded through_run_generate_pipelineand_render_one; applied todirty_file_path/transcript_pathand to the manifest callsynthbanshee/package/manifest.py_maybe_relativehelper; newrelative_tokwarg ongenerate_manifest; applied towav_pathandstrong_labels_pathcolumnssynthbanshee/labels/schema.pydirty_file_path/transcript_pathdocumenting the relative-to-data-root contracttests/unit/test_cli.pyTestRunGeneratePipelinecases — paths are repo-relative under corpus layout, and fall back to absolute when paths sit outside the configured root. Also fixed a mock signature (_fail_then_succeed) to accept the newdata_rootkwargtests/unit/test_manifest.pyTestGenerateManifestcases —relative_torewrites columns;relative_to=Nonekeeps absolute paths (back-compat)Test plan
.venv/bin/pytest tests/unit/test_cli.py::TestRunGeneratePipeline tests/unit/test_manifest.py -v— 33/33 passing (4 new tests + 29 unchanged).venv/bin/pytest tests/unit/ -q— 1700/1700 passing (full unit suite, no regressions)ruff check+mypyon all changed files — cleansynthbanshee generate --helpshows--data-root(envvarSYNTHBANSHEE_DATA_ROOT) on both commandsTier-3 ASR sanity (local)
Not applicable — this PR touches
synthbanshee/cli.py(path-shape changeonly, no audio rendering touched),
synthbanshee/package/manifest.py,and
synthbanshee/labels/schema.py(docstrings onstr | Nonefields).None of
synthbanshee/tts/,synthbanshee/script/,synthbanshee/augment/,nor any speaker / scene / acoustic / project YAML config is modified. Per
CLAUDE.md "ASR sanity check policy", this PR is exempt from the local
qa-report --asrrun.Corpus impact
Existing on-disk corpus clips (delivery-003 included) still have absolute
paths baked in — the next regen will overwrite them. The delivery-003 PR
already post-processed its 20 clips to relative, so no follow-up is
needed on the corpus side.
Fixes #108.
🤖 Generated with Claude Code