Agent Eval Suite is an opinionated open-source evaluation layer for agent runs, built for teams that need release-quality evidence instead of one-off spot checks. It focuses on deterministic scoring, trajectory-aware analysis, and CI-friendly pass/fail outcomes so model, prompt, and toolchain changes can be evaluated with the same rigor as software changes.
The project generalizes the bench + replay pattern into a reusable product surface: ingest traces from common agent stacks, score them with deterministic judges, compare against baselines, and gate releases automatically. It also ships a local-first evidence pack format that preserves run config, case verdicts, and replay artifacts so every decision is auditable and reproducible.
Ship agents with evidence, not guesswork:
- Stable trace schema and replay contract.
- Deterministic-first judges for hard correctness and policy conformance.
- Baseline-vs-candidate comparison with CI gate exit codes.
- Local artifact outputs that make regressions explainable.
- Trace schema + replay contract for agent runs.
- Offline eval runner + baseline compare + CI gate exit codes.
- Judge plugin system.
- Deterministic judges:
ToolContractJudgePolicyJudgeRegexJudge/JSONSchemaJudge- release-risk judges:
CostBudgetJudge,LatencySLOJudge,RetryStormJudge,LoopGuardJudge,ToolAbuseJudge,PromptInjectionJudge
- Optional
LeanJudgeplugin via external adapter command contract. - Local artifact outputs:
- machine-readable JSON report
- evidence pack folder structure
- Stability/flakiness runner (
stability-check) with quarantine recommendations. - Baseline governance primitives: promotions, approvals, waivers, audit logs.
- Integrity/provenance tools: manifest hashes + attest/verify commands.
- Framework import adapters and public benchmark generation.
trace: ordered events from an agent run, including tool calls and outcomes.replay: deterministic re-execution of a trace under a pinned config.judge: scoring component that emitspass/fail, score, and evidence.baseline: reference run used for regression comparison.gate: policy that converts eval deltas into CI pass/fail behavior.
Minimum required entities:
runrun_id,dataset_id,agent_version,model,started_at,seed
eventidx,ts,actor,type,input,output,tool,error,latency_ms
judge_resultjudge_id,case_id,score,passed,reason,evidence_refs
aggregate_resultpass_rate,hard_fail_rate,regression_delta,gate_status
Contract priorities:
- Backward compatibility for schema revisions.
- Deterministic replay when config and artifacts are pinned.
- Explicit failure taxonomy (timeouts, contract violations, policy failures, parse/type errors where applicable).
Each run writes a local evidence pack:
evidence_pack/
manifest.json
run/
config.json
summary.json
events.jsonl
judges/
tool_contract.json
policy.json
regex.json
json_schema.json
lean.json # optional
compare/
baseline_delta.json
gate_decision.json
cases/
<case_id>/
trajectory.json
verdicts.json
artifacts/
This repository is the open foundation:
- stable contracts
- deterministic eval primitives
- local-first evidence portability
Enterprise packaging is intentionally out-of-repo.
LeanJudge is independent from any specific prover API. Configure a command in judge config:
{
"lean": {
"command": ["my-lean-checker", "--json-stdin"]
}
}Contract:
- Input on stdin: JSON
lean_payloadfrom each case metadata. - Output on stdout: JSON object with at least
passed(bool), optionalreason,evidence.
python -m pip install -e .
agent-eval init --out .
agent-eval run --suite examples/suite_good.json --out runs/baseline --run-id baseline-1
agent-eval run --suite examples/suite_bad.json --out runs/candidate --run-id candidate-1
agent-eval compare --baseline runs/baseline --candidate runs/candidate --out runs/candidate/compare/baseline_delta.json
agent-eval gate \
--compare runs/candidate/compare/baseline_delta.json \
--min-pass-rate 0.95 \
--max-hard-fail-increase 0.00 \
--max-regressed-cases 0 \
--max-new-hard-fail-cases 0CI usage: agent-eval gate returns exit code 0 on pass and 1 on gate failure.
agent-eval compare includes aggregate deltas and per-case regression details (case_regressions) when run artifacts are available.
It also emits richer report sections: overview, top_regressed_judges, ranked failure_clusters,
triage.top_clusters, and release_impact scoring/recommendation.
Use the local registry to track datasets and named baselines:
agent-eval registry dataset-add --suite suites/starter_suite.json --dataset-id starter-suite
agent-eval registry baseline-set --name main --run runs/baseline
agent-eval registry baseline-promote --name main --run runs/baseline --approved-by qa@company --rationale "release baseline"
agent-eval compare --baseline main --candidate runs/candidateRegistry default path: .agent_eval/registry.json (override with --registry-path).
By default, compare enforces baseline/candidate compatibility (dataset and case checks). Use --allow-incompatible to bypass.
Waivers (scoped by baseline/case/judge) are supported and can be applied during gate:
agent-eval registry waiver-add --baseline-name main --case-id case-42 --approved-by qa@company --reason "known issue"
agent-eval gate --compare runs/candidate/compare/baseline_delta.json --max-regressed-cases 0 --apply-waivers --baseline-name mainrun-loop executes iterative agent attempts with deterministic scoring on each iteration.
agent-eval run-loop \
--suite suites/starter_suite.json \
--out runs/loop \
--propose-command "python my_agent_adapter.py" \
--strict-side-effects \
--max-repairs 2Adapter command contract:
- Reads JSON payload from stdin (
mode,case_id,input,expected_output,attempt,previous_attempts, contracts/policy). - Writes JSON to stdout:
assistant_outputtool_callsas[{ "tool": "...", "arguments": {...} }]
Tool execution is deterministic from per-case metadata.tool_responses.
For argument-level determinism, provide metadata.tool_response_cassette.
Every run records pinned environment metadata in run/config.json.
Replay verifies:
- summary parity against saved artifacts
- per-case verdict parity
- pinned environment compatibility
agent-eval replay --run runs/candidate --out runs/candidate/compare/replay_report.jsonagent-eval replay returns exit code 0 on full replay match and 1 otherwise.
For propose/execute/repair runs, execution replay re-runs adapter commands and checks trajectory/verdict parity:
agent-eval replay-exec --run runs/loop --out runs/loop/compare/replay_exec_report.jsonagent-eval replay-exec returns exit code 0 only when execution replay fully matches.
Generate and verify evidence integrity attestations:
agent-eval attest --run runs/candidate --secret "$ATTESTATION_SECRET"
agent-eval verify-attestation --run runs/candidate --secret "$ATTESTATION_SECRET"Use import-trace to normalize external exports into this repo's trace schema:
agent-eval import-trace --provider auto --input exports/provider_dump.json --out suites/imported.json --dataset-id imported-suiteSupported providers:
auto(detect format per record)openaianthropicvertexfoundry
Imported events are enriched with OTel-style trace/span identifiers and GenAI attributes.
Use --strict to fail on unknown top-level provider fields or empty parsed traces.
Adapter conformance tests are included under tests/fixtures/adapters/ and tests/test_adapter_conformance.py.
Run strict conformance checks:
agent-eval adapter-conformance --fixtures-dir tests/fixtures/adapters --min-fixtures-per-provider 2Framework-native imports are also supported:
agent-eval import-framework --framework auto --input exports/framework_dump.json --out suites/framework_imported.jsonSupported frameworks:
autolanggraphopenai_agentsautogencrewaisemantic_kernel
Available deterministic risk judges:
cost_budgetlatency_sloretry_stormloop_guardtool_abuseprompt_injection
Enable any subset with repeated --judge flags and pass config via --judge-config.
Run repeated evals to detect flaky cases and produce quarantine recommendations:
agent-eval stability-check --suite suites/starter_suite.json --runs 10 --out runs/stability.jsonstability-check returns exit code 1 when flaky cases are detected.
Generate synthetic public benchmark suites by archetype:
agent-eval benchmark-generate --archetype support_agent --cases 50 --out benchmarks/public/support_agent_50.jsonExport any run to OpenTelemetry-style GenAI JSONL:
agent-eval export-otel --run runs/candidate --out runs/candidate/otel_events.jsonlRuntime failures return machine-readable JSON on stderr:
{"error":{"code":"validation_error","message":"...","details":{...}}}Validate suites:
agent-eval schema validate --input suites/starter_suite.json --strict --require-version 1.0.0Migrate legacy suites:
agent-eval schema migrate --input legacy_suite.json --output suites/migrated_suite.jsonRun combined schema back-compat + adapter checks:
agent-eval contracts-check \
--schema-fixtures-dir tests/fixtures/schema_backcompat \
--adapter-fixtures-dir tests/fixtures/adapters \
--min-fixtures-per-provider 2Generate a human-readable report from compare/gate/replay artifacts:
agent-eval report markdown \
--compare runs/candidate/compare/baseline_delta.json \
--gate runs/candidate/compare/gate_decision.json \
--replay runs/candidate/compare/replay_report.json \
--out runs/candidate/compare/report.md \
--title "Release Eval Report"No hosted CI integration is required for packaging:
./scripts/check_contracts.sh
./scripts/release_local.sh
docker build -t agent-eval-suite:0.1.2 .init scaffolds CI templates for:
- GitLab CI
- Buildkite
- CircleCI
- Jenkins