Skip to content

Add eval harness for testing AGENTS.md changes#69308

Open
RoyLee1224 wants to merge 16 commits into
apache:mainfrom
RoyLee1224:feat/skill-eval-harness
Open

Add eval harness for testing AGENTS.md changes#69308
RoyLee1224 wants to merge 16 commits into
apache:mainfrom
RoyLee1224:feat/skill-eval-harness

Conversation

@RoyLee1224

@RoyLee1224 RoyLee1224 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

What

Adds an eval harness (dev/skill-evals/) for testing AGENTS.md guidance against real scenarios. It answers: "does my AGENTS.md change actually affect agent behavior?"

uv run dev/skill-evals/eval.py --repeat 3

Compares the main branch AGENTS.md against your working tree. Each arm is a git worktree with the full repo, so the agent sees real source files. If AGENTS.md is unchanged, the working arm is skipped automatically.

No API key needed — authenticates via Claude Code OAuth (claude /login).

Demo: testing the newsfragment golden rule

Following the discussion in #release-management
Tested the golden rule (#67982) against real PRs where reviewers asked to remove newsfragments:

Case with AGENTS.md without
Provider pod leak #67333 3/3 ✓ 0/3
API optimization #66696 3/3 ✓ 1/3
Scheduler fix #64322 2/3 0/3
i18n cache fix #65720 1/3 0/3
Worker connection testing #62343 (positive — should create) 3/3 ✓ 3/3
Coordinator layer #65958 (positive — should create) 3/3 ✓ 3/3

Golden rule works for clear cases but struggles with ambiguous fixes where the model reasons "this bug affects users."

Open question: I'm not sure the current case selection is the right design. Would appreciate maintainers' thoughts on what cases are worth covering.

cli demo

CleanShot 2026-07-02 at 00 57 34@2x

UI demo(run npx promptfoo@0.121.17 view):

ui demo

How it works

  1. Creates git worktrees — one with main's AGENTS.md, one with your working tree version. Both are full repo checkouts.
  2. Generates a promptfoo config with anthropic:claude-agent-sdk provider and structured JSON output.
  3. Runs each case against all arms in parallel, reports diff.
  4. Worktrees cleaned up on exit.

Files

dev/skill-evals/
  eval.py                      # entry point (Python, per dev/ guidelines)
  cases/newsfragment.yaml      # cases from real PRs (#64322, #65720, #66696, #67333)
  README.md                    # setup and usage
.pre-commit-config.yaml        # prek hook: remind to run eval on AGENTS.md changes
scripts/ci/prek/check_eval_reminder.py

Future use

  • Auto-generate a summary table (like the one in this PR) from promptfoo results. Might be useful for pasting into PRs that change AGENTS.md; open to discussion
  • As models improve, some guidance may become unnecessary — the eval helps determine which rules the model still depends on and which can be safely removed
  • Add cases for routing rules that matter — the prek hook reminds contributors to run the eval when AGENTS.md changes
  • Use the eval to validate moving guidance from AGENTS.md into skills — run before and after to confirm no regression
  • Currently tests Claude only; the architecture (promptfoo + structured output) could extend to other agent runtimes

Was generative AI tooling used to co-author this PR?
  • Yes, Claude Code (Opus 4.8)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@RoyLee1224 RoyLee1224 force-pushed the feat/skill-eval-harness branch from b73f705 to e95339b Compare July 3, 2026 08:02
@RoyLee1224 RoyLee1224 changed the title Feat/skill eval harness Add eval harness for testing AGENTS.md changes Jul 3, 2026
@RoyLee1224

Copy link
Copy Markdown
Contributor Author

This PR focuses on the harness design plus one base case (the newsfragment golden rule); expanding coverage can come in follow-up PRs. I'm also not sure the current case design is right, maintainers' thoughts welcome!

@potiuk

potiuk commented Jul 3, 2026

Copy link
Copy Markdown
Member

Nice. Just a static check failure :)

Comment thread scripts/ci/prek/check_eval_reminder.py Outdated
Comment thread dev/skill-evals/eval.py

@jason810496 jason810496 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's really simple and lightweight (for the user setup perspective, I didn't check whether the promptfoo itself is lightweight or not).

I will update the casees I that just thought which also suitable for eval shortly.

Comment thread dev/skill-evals/cases/newsfragment.yaml
Comment thread dev/skill-evals/eval.py Outdated
Comment thread dev/skill-evals/eval.py Outdated
Comment thread dev/skill-evals/eval.py Outdated
Comment thread dev/skill-evals/README.md Outdated
Comment thread scripts/ci/prek/check_eval_reminder.py Outdated
@jason810496 jason810496 removed the backport-to-v3-3-test Backport to v3-3-test label Jul 3, 2026
@RoyLee1224 RoyLee1224 force-pushed the feat/skill-eval-harness branch 4 times, most recently from fa09f92 to 1354e54 Compare July 4, 2026 15:54
@RoyLee1224 RoyLee1224 force-pushed the feat/skill-eval-harness branch from 254e3ee to bf5eedd Compare July 5, 2026 12:38
RoyLee1224 added 13 commits July 5, 2026 21:48
  - Verify CLAUDE.md → AGENTS.md symlink before running; baseline arm removes both
  - Pin promptfoo to 0.121.17
  - Fail fast on git worktree errors and clean up worktrees on partial failure
  - Reject SKILL_NAME + --full (baseline arm would always fail skill-used assert)
  - Add tests for check_eval_reminder; match guidance file basenames exactly
  - each run exports a JSON report to files/skill-evals/
  - promptfoo db/cache and the SDK install move from the home directory
    to .build/ (tool state, not output) — nothing lands in /Users/lizhechen anymore
  - run-skill-eval only triggers on guidance changes during bulk manual
    runs; documented invocation gains --all-files
  - README: cleanup section, stage-before-prek-run reminder
  Direct npx invocation needed a manually installed SDK (promptfoo
  resolves it from the config directory); the prek env provides it.
  eval.py now points at  when promptfoo is
  absent. Promptfoo flags remain wireable via hook-variant entry args.
  - 62343: merged with 62343.feature.rst, ground truth should_create=true
  - 65958: proposed as a should-create example by its author in review
  - view-skill-eval manual hook reuses run-skill-eval's node env
@RoyLee1224 RoyLee1224 force-pushed the feat/skill-eval-harness branch from bf5eedd to 344296a Compare July 5, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants