Summary
Some benchmark metadata files advertise behavior that does not match the checked-in converter scripts/tests. This can mislead users trying to run external benchmark adapters.
Findings
1. ProgramBench metadata advertises a CLI that is a no-op
benchmarks/programbench/benchmark.yaml says benchflow.py supports flags such as:
--output-dir
--limit
--overwrite
--task-ids
But benchmarks/programbench/benchflow.py has no main() / argparse entrypoint.
Repro:
uv run python benchmarks/programbench/benchflow.py --help
Actual: exits 0 with empty stdout/stderr.
Expected: either a working converter CLI/help output or corrected metadata that says this is an import-only module.
2. OpaqueToolsBench metadata says no oracle solutions, but converter emits them
benchmarks/opaquetoolsbench/benchmark.yaml says:
has_oracle_solutions: false
But benchmarks/opaquetoolsbench/benchflow.py emits solution/solve.sh, and tests/test_adapter_scripts.py appears to guard that behavior.
Expected: descriptor metadata should match converter output.
Impact
P2/P3 adapter metadata drift. This is not as severe as converter output corruption, but it makes the benchmark catalog unreliable for users and automation that reads descriptor capabilities.
Suggested fix
- Add lightweight tests that execute
python benchmarks/*/benchflow.py --help where metadata advertises CLI flags.
- Validate
benchmark.yaml capability fields against converter output for representative generated tasks.
- Either implement ProgramBench CLI entrypoint or remove CLI flag claims from metadata.
- Correct OpaqueToolsBench
has_oracle_solutions or adjust converter/tests.
Summary
Some benchmark metadata files advertise behavior that does not match the checked-in converter scripts/tests. This can mislead users trying to run external benchmark adapters.
Findings
1. ProgramBench metadata advertises a CLI that is a no-op
benchmarks/programbench/benchmark.yamlsaysbenchflow.pysupports flags such as:--output-dir--limit--overwrite--task-idsBut
benchmarks/programbench/benchflow.pyhas nomain()/ argparse entrypoint.Repro:
Actual: exits
0with empty stdout/stderr.Expected: either a working converter CLI/help output or corrected metadata that says this is an import-only module.
2. OpaqueToolsBench metadata says no oracle solutions, but converter emits them
benchmarks/opaquetoolsbench/benchmark.yamlsays:But
benchmarks/opaquetoolsbench/benchflow.pyemitssolution/solve.sh, andtests/test_adapter_scripts.pyappears to guard that behavior.Expected: descriptor metadata should match converter output.
Impact
P2/P3 adapter metadata drift. This is not as severe as converter output corruption, but it makes the benchmark catalog unreliable for users and automation that reads descriptor capabilities.
Suggested fix
python benchmarks/*/benchflow.py --helpwhere metadata advertises CLI flags.benchmark.yamlcapability fields against converter output for representative generated tasks.has_oracle_solutionsor adjust converter/tests.