Add eval harness for testing AGENTS.md changes#69308
Open
RoyLee1224 wants to merge 16 commits into
Open
Conversation
b73f705 to
e95339b
Compare
Contributor
Author
|
This PR focuses on the harness design plus one base case (the newsfragment golden rule); expanding coverage can come in follow-up PRs. I'm also not sure the current case design is right, maintainers' thoughts welcome! |
Member
|
Nice. Just a static check failure :) |
potiuk
reviewed
Jul 3, 2026
potiuk
reviewed
Jul 3, 2026
jason810496
reviewed
Jul 3, 2026
jason810496
left a comment
Member
There was a problem hiding this comment.
Nice! It's really simple and lightweight (for the user setup perspective, I didn't check whether the promptfoo itself is lightweight or not).
I will update the casees I that just thought which also suitable for eval shortly.
fa09f92 to
1354e54
Compare
1 task
254e3ee to
bf5eedd
Compare
- Verify CLAUDE.md → AGENTS.md symlink before running; baseline arm removes both - Pin promptfoo to 0.121.17 - Fail fast on git worktree errors and clean up worktrees on partial failure - Reject SKILL_NAME + --full (baseline arm would always fail skill-used assert) - Add tests for check_eval_reminder; match guidance file basenames exactly
- each run exports a JSON report to files/skill-evals/
- promptfoo db/cache and the SDK install move from the home directory
to .build/ (tool state, not output) — nothing lands in /Users/lizhechen anymore
- run-skill-eval only triggers on guidance changes during bulk manual
runs; documented invocation gains --all-files
- README: cleanup section, stage-before-prek-run reminder
Direct npx invocation needed a manually installed SDK (promptfoo resolves it from the config directory); the prek env provides it. eval.py now points at when promptfoo is absent. Promptfoo flags remain wireable via hook-variant entry args.
- 62343: merged with 62343.feature.rst, ground truth should_create=true - 65958: proposed as a should-create example by its author in review - view-skill-eval manual hook reuses run-skill-eval's node env
bf5eedd to
344296a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an eval harness (
dev/skill-evals/) for testing AGENTS.md guidance against real scenarios. It answers: "does my AGENTS.md change actually affect agent behavior?"Compares the
mainbranch AGENTS.md against your working tree. Each arm is a git worktree with the full repo, so the agent sees real source files. If AGENTS.md is unchanged, the working arm is skipped automatically.No API key needed — authenticates via Claude Code OAuth (
claude /login).Demo: testing the newsfragment golden rule
Following the discussion in #release-management
Tested the golden rule (#67982) against real PRs where reviewers asked to remove newsfragments:
Golden rule works for clear cases but struggles with ambiguous fixes where the model reasons "this bug affects users."
Open question: I'm not sure the current case selection is the right design. Would appreciate maintainers' thoughts on what cases are worth covering.
cli demo
UI demo(run
npx promptfoo@0.121.17 view):How it works
main's AGENTS.md, one with your working tree version. Both are full repo checkouts.anthropic:claude-agent-sdkprovider and structured JSON output.Files
Future use
Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.