Skip to content

Benchmark descriptors advertise stale CLIs/metadata that do not match converter behavior #369

@xdotli

Description

@xdotli

Summary

Some benchmark metadata files advertise behavior that does not match the checked-in converter scripts/tests. This can mislead users trying to run external benchmark adapters.

Findings

1. ProgramBench metadata advertises a CLI that is a no-op

benchmarks/programbench/benchmark.yaml says benchflow.py supports flags such as:

  • --output-dir
  • --limit
  • --overwrite
  • --task-ids

But benchmarks/programbench/benchflow.py has no main() / argparse entrypoint.

Repro:

uv run python benchmarks/programbench/benchflow.py --help

Actual: exits 0 with empty stdout/stderr.

Expected: either a working converter CLI/help output or corrected metadata that says this is an import-only module.

2. OpaqueToolsBench metadata says no oracle solutions, but converter emits them

benchmarks/opaquetoolsbench/benchmark.yaml says:

has_oracle_solutions: false

But benchmarks/opaquetoolsbench/benchflow.py emits solution/solve.sh, and tests/test_adapter_scripts.py appears to guard that behavior.

Expected: descriptor metadata should match converter output.

Impact

P2/P3 adapter metadata drift. This is not as severe as converter output corruption, but it makes the benchmark catalog unreliable for users and automation that reads descriptor capabilities.

Suggested fix

  • Add lightweight tests that execute python benchmarks/*/benchflow.py --help where metadata advertises CLI flags.
  • Validate benchmark.yaml capability fields against converter output for representative generated tasks.
  • Either implement ProgramBench CLI entrypoint or remove CLI flag claims from metadata.
  • Correct OpaqueToolsBench has_oracle_solutions or adjust converter/tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    fixedVerified fixed by running the patched code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions