BenchFlow uses a resource-verb pattern: bench <resource> <verb>.
List all registered agents with their protocol and native/default auth
requirements. Provider-prefixed models may use provider-specific credentials;
Azure Foundry models use AZURE_API_KEY plus AZURE_API_ENDPOINT.
bench agent listShow details for a specific agent, including native/default auth and a note about provider-specific credentials.
bench agent show geminiCreate and run an evaluation. Use it for YAML configs and batch runs; it also accepts a single task directory.
# From YAML config
bench eval create --config benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml
# From remote repo (fast Daytona batch; token usage may be unavailable)
bench eval create \
--source-repo benchflow-ai/skillsbench \
--source-path tasks \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--concurrency 64 \
--sandbox-setup-timeout 300
# From remote repo with required token usage telemetry through an external tunnel
bench eval create \
--source-repo benchflow-ai/skillsbench \
--source-path tasks \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--usage-tracking required \
--usage-proxy-url https://your-tunnel.example.com \
--usage-proxy-port 18081 \
--concurrency 1 \
--sandbox-setup-timeout 300
# From local directory
bench eval create --tasks-dir ./tasks --agent gemini --model gemini-3.1-flash-lite-preview
# From a hosted PrimeIntellect / Verifiers environment
bench eval create \
--source-env primeintellect/general-agent \
--source-env-version 0.1.1 \
--source-env-arg task=calendar_scheduling_t0 \
--agent gemini \
--model google/gemini-2.5-flash-lite
# Single task with mounted skills and the recommended skill nudge
bench eval create \
--tasks-dir tasks/pdf-fix \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--skills-dir tasks/pdf-fix/environment/skills \
--agent-env BENCHFLOW_SKILL_NUDGE=name| Flag | Default | Description |
|---|---|---|
--config |
— | YAML config file |
--tasks-dir |
— | Local task dir (single task with task.toml, or parent of many) |
--source-repo |
— | Remote repo as org/repo (e.g. benchflow-ai/skillsbench) |
--source-path |
— | Subpath within the repo (e.g. tasks) |
--source-ref |
— | Branch or tag to clone (e.g. main) |
--source-env |
— | Hosted environment source (e.g. primeintellect/general-agent) |
--source-env-version |
— | Hosted environment version |
--source-env-arg |
— | Hosted environment argument as KEY=VALUE; repeatable |
--source-env-num-examples |
1 |
Number of hosted environment examples |
--source-env-rollouts-per-example |
1 |
Rollouts per hosted environment example |
--source-env-max-tokens |
1024 |
Max tokens for hosted environment model calls |
--source-env-temperature |
0.0 |
Temperature for hosted environment model calls |
--source-env-sampling-arg |
— | Verifiers sampling argument as KEY=VALUE; repeatable (for example reasoning_effort=minimal) |
--agent |
claude-agent-acp |
Agent name |
--model |
Agent default | Model ID |
--sandbox |
docker |
Sandbox: docker, daytona, or modal |
--usage-tracking |
auto |
Token usage telemetry policy: auto, required, or off |
--usage-proxy-url |
— | Externally reachable usage-proxy base URL for remote sandboxes such as Daytona |
--usage-proxy-bind-host |
auto | Local interface for the usage proxy; external proxy mode defaults to 127.0.0.1 |
--usage-proxy-port |
random | Fixed local port for externally tunneled usage tracking |
--environment-manifest |
— | Path to an Environment-plane manifest (environment.toml); applied to every rollout in the batch |
--concurrency |
4 |
Max concurrent tasks (batch mode only) |
--agent-idle-timeout |
(built-in default) | Abort ACP prompts after this many idle seconds; 0 disables idle detection |
--jobs-dir |
jobs |
Output directory |
--sandbox-user |
agent |
Sandbox user (null for root) |
--sandbox-setup-timeout |
120 |
Timeout in seconds for sandbox user setup |
--skills-dir |
— | Skills directory to deploy into each task sandbox; use auto for each task's environment/skills |
--skill-mode |
default |
Skill mode: default or self-gen |
--skill-creator-dir |
— | Path to a skill-creator directory (or a skills root containing it); used when --skill-mode self-gen |
--self-gen-no-internet |
false |
Disable web tools for the self-generated skill run |
--agent-env |
— | Agent environment variable as KEY=VALUE; repeatable |
--include |
— | Only run these task names; repeatable (e.g. --include jax-computing-basics --include data-to-d3) |
--exclude |
— | Skip these task names; repeatable (e.g. --exclude quantum-numerical-simulation) |
When mounting skills, the recommended docs default is
--agent-env BENCHFLOW_SKILL_NUDGE=name. See
Architecture: skill loading for how
--skills-dir is registered with each agent and how the nudge modes differ.
For official Daytona batch runs that must report provider token/cost telemetry,
use --usage-tracking required with a tunnel or ingress URL pointing at the
fixed --usage-proxy-port. The fixed-port tunnel mode supports one rollout per
BenchFlow process; use --concurrency 1, or run multiple jobs with separate
ports/tunnels. This limit applies only to metered external-tunnel mode; Daytona
batch runs that do not require usage telemetry can still use higher concurrency.
Without an external URL, Daytona runs continue in auto mode and record
usage_source=unavailable because the remote sandbox cannot reach a host-bound
proxy.
--source-env is for external hosted environment hubs. The first supported
runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity
(env_uid, hub_url), installs the versioned package into an isolated local
virtual environment, and runs vf-eval. --sandbox remains the BenchFlow task
sandbox selector for local/repo task sources; Verifiers source environments own
their own harness and sandbox behavior. --model is passed to the Verifiers
model endpoint; use a model id available to that provider. Provider-specific
sampling options are not inferred; pass them explicitly with
--source-env-sampling-arg.
List completed evaluations from a jobs directory.
bench eval list jobs/List skills discovered under the default skills roots (or --dir).
bench skills list
bench skills list --dir ./skillsEvaluate a skill against its evals.json test cases.
bench skills eval skills/my-skill/ \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytonaScaffold a new benchmark task.
bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/Validate a task directory (Dockerfile, instruction.md, tests/).
bench tasks check tasks/my-taskGenerate benchmark task directories from real agent traces.
bench tasks generate --from-local --project my-repo --limit 5
bench tasks generate --from-file session.jsonl --dry-run
bench tasks generate --from-hf opentraces-test --limit 50| Flag | Default | Description |
|---|---|---|
--from-local |
— | Generate from local Claude Code sessions |
--from-file |
— | Generate from a JSONL trace file |
--from-hf |
— | Generate from a HuggingFace dataset ID or alias |
--output |
tasks |
Output directory for generated tasks |
--projects-dir |
~/.claude/projects/ |
Claude Code projects directory |
--project |
— | Filter local sessions by project path substring |
--format |
auto |
Trace format override |
--split |
train |
HuggingFace dataset split |
--max-rows |
100 |
Max rows to download from HuggingFace |
--limit |
20 |
Max traces to process |
--min-steps |
2 |
Minimum steps per trace |
--outcome |
— | Filter by outcome: success, failure, unknown |
--author |
benchflow-traces |
Author name for generated task metadata |
--dry-run |
false |
Preview traces without generating tasks |
List known HuggingFace trace datasets. The aliases listed here can be passed
to bench tasks generate --from-hf.
bench tasks list-sourcesCreate an environment object from a task directory. This validates environment construction but does not start the sandbox.
bench environment create tasks/my-task --sandbox daytonaList active Daytona sandboxes, or list a hosted hub.
bench environment list
bench environment list --hub primeintellect --owner primeintellect --search general-agent --limit 5Show hosted environment metadata.
bench environment show primeintellect/general-agent --version 0.1.1Inspect a file from a hosted environment package.
bench environment inspect primeintellect/general-agent --version 0.1.1 --path README.mdClean up orphaned Daytona sandboxes. By default this deletes sandboxes older
than 24 hours; use --dry-run to preview what would be deleted.
bench environment cleanup --dry-run --max-age 1440Third-party framework compatibility checks.
Inventory or structurally check representative Harbor registry tasks. Defaults to running an inventory pass against the public Harbor registry JSON.
# Inventory the public Harbor registry
bench compat harbor-registry
# Structural check, two tasks per dataset, JSONL output
bench compat harbor-registry --level check --tasks-per-dataset 2 --out compat.jsonl| Flag | Default | Description |
|---|---|---|
--registry |
Harbor public registry URL | Harbor registry JSON URL or local file |
--tasks-per-dataset |
2 |
Representative tasks selected per dataset |
--level |
inventory |
Compatibility level: inventory or check |
--out |
— | Optional JSONL output path |
--cache-dir |
.cache/compat/harbor |
Cache directory for sparse clones |
--limit |
— | Optional cap on selected task refs |
source:
repo: benchflow-ai/skillsbench
path: tasks
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300
agent: gemini
model: gemini-3.1-flash-lite-preview
skills_dir: shared-skills/
agent_env:
BENCHFLOW_SKILL_NUDGE: name
max_retries: 2Use the Python API for multi-scene experiments. bench eval create --config is for
batch job configs; scene configs are loaded with benchflow._utils.yaml_loader or built
directly in Python.
task_dir: tasks/my-task
environment: daytona
sandbox_setup_timeout: 300
scenes:
- name: skill-gen
roles:
- name: creator
agent: gemini
model: gemini-3.1-flash-lite-preview
turns:
- role: creator
prompt: "Analyze the task and write a skill document to /app/generated-skill.md"
- name: solve
roles:
- name: solver
agent: gemini
model: gemini-3.1-flash-lite-preview
turns:
- role: solver