Benchmark descriptors advertise stale CLIs/metadata that do not match converter behavior

## Summary

Some benchmark metadata files advertise behavior that does not match the checked-in converter scripts/tests. This can mislead users trying to run external benchmark adapters.

## Findings

### 1. ProgramBench metadata advertises a CLI that is a no-op

`benchmarks/programbench/benchmark.yaml` says `benchflow.py` supports flags such as:

- `--output-dir`
- `--limit`
- `--overwrite`
- `--task-ids`

But `benchmarks/programbench/benchflow.py` has no `main()` / argparse entrypoint.

Repro:

```bash
uv run python benchmarks/programbench/benchflow.py --help
```

Actual: exits `0` with empty stdout/stderr.

Expected: either a working converter CLI/help output or corrected metadata that says this is an import-only module.

### 2. OpaqueToolsBench metadata says no oracle solutions, but converter emits them

`benchmarks/opaquetoolsbench/benchmark.yaml` says:

```yaml
has_oracle_solutions: false
```

But `benchmarks/opaquetoolsbench/benchflow.py` emits `solution/solve.sh`, and `tests/test_adapter_scripts.py` appears to guard that behavior.

Expected: descriptor metadata should match converter output.

## Impact

P2/P3 adapter metadata drift. This is not as severe as converter output corruption, but it makes the benchmark catalog unreliable for users and automation that reads descriptor capabilities.

## Suggested fix

- Add lightweight tests that execute `python benchmarks/*/benchflow.py --help` where metadata advertises CLI flags.
- Validate `benchmark.yaml` capability fields against converter output for representative generated tasks.
- Either implement ProgramBench CLI entrypoint or remove CLI flag claims from metadata.
- Correct OpaqueToolsBench `has_oracle_solutions` or adjust converter/tests.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark descriptors advertise stale CLIs/metadata that do not match converter behavior #369

Summary

Findings

1. ProgramBench metadata advertises a CLI that is a no-op

2. OpaqueToolsBench metadata says no oracle solutions, but converter emits them

Impact

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Benchmark descriptors advertise stale CLIs/metadata that do not match converter behavior #369

Description

Summary

Findings

1. ProgramBench metadata advertises a CLI that is a no-op

2. OpaqueToolsBench metadata says no oracle solutions, but converter emits them

Impact

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions