Add eval harness for testing AGENTS.md changes by RoyLee1224 · Pull Request #69308 · apache/airflow

RoyLee1224 · 2026-07-03T08:02:27Z

What

Adds an eval harness (dev/skill-evals/) for testing AGENTS.md guidance against real scenarios. It answers: "does my AGENTS.md change actually affect agent behavior?"

uv run dev/skill-evals/eval.py --repeat 3

Compares the main branch AGENTS.md against your working tree. Each arm is a git worktree with the full repo, so the agent sees real source files. If AGENTS.md is unchanged, the working arm is skipped automatically.

No API key needed — authenticates via Claude Code OAuth (claude /login).

Demo: testing the newsfragment golden rule

Following the discussion in #release-management
Tested the golden rule (#67982) against real PRs where reviewers asked to remove newsfragments:

Case	with AGENTS.md	without
Provider pod leak #67333	3/3 ✓	0/3
API optimization #66696	3/3 ✓	1/3
Scheduler fix #64322	2/3	0/3
i18n cache fix #65720	1/3	0/3
Worker connection testing #62343 (positive — should create)	3/3 ✓	3/3
Coordinator layer #65958 (positive — should create)	3/3 ✓	3/3

Golden rule works for clear cases but struggles with ambiguous fixes where the model reasons "this bug affects users."

Open question: I'm not sure the current case selection is the right design. Would appreciate maintainers' thoughts on what cases are worth covering.

cli demo

UI demo(run `npx promptfoo@0.121.17 view`):

How it works

Creates git worktrees — one with main's AGENTS.md, one with your working tree version. Both are full repo checkouts.
Generates a promptfoo config with anthropic:claude-agent-sdk provider and structured JSON output.
Runs each case against all arms in parallel, reports diff.
Worktrees cleaned up on exit.

Files

dev/skill-evals/
  eval.py                      # entry point (Python, per dev/ guidelines)
  cases/newsfragment.yaml      # cases from real PRs (#64322, #65720, #66696, #67333)
  README.md                    # setup and usage
.pre-commit-config.yaml        # prek hook: remind to run eval on AGENTS.md changes
scripts/ci/prek/check_eval_reminder.py

Future use

Auto-generate a summary table (like the one in this PR) from promptfoo results. Might be useful for pasting into PRs that change AGENTS.md; open to discussion
As models improve, some guidance may become unnecessary — the eval helps determine which rules the model still depends on and which can be safely removed
Add cases for routing rules that matter — the prek hook reminds contributors to run the eval when AGENTS.md changes
Use the eval to validate moving guidance from AGENTS.md into skills — run before and after to confirm no regression
Currently tests Claude only; the architecture (promptfoo + structured output) could extend to other agent runtimes

Was generative AI tooling used to co-author this PR?

Yes, Claude Code (Opus 4.8)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

RoyLee1224 · 2026-07-03T08:14:33Z

This PR focuses on the harness design plus one base case (the newsfragment golden rule); expanding coverage can come in follow-up PRs. I'm also not sure the current case design is right, maintainers' thoughts welcome!

potiuk · 2026-07-03T10:17:39Z

Nice. Just a static check failure :)

jason810496

Nice! It's really simple and lightweight (for the user setup perspective, I didn't check whether the promptfoo itself is lightweight or not).

I will update the casees I that just thought which also suitable for eval shortly.

…latest

…real PRs

- Verify CLAUDE.md → AGENTS.md symlink before running; baseline arm removes both - Pin promptfoo to 0.121.17 - Fail fast on git worktree errors and clean up worktrees on partial failure - Reject SKILL_NAME + --full (baseline arm would always fail skill-used assert) - Add tests for check_eval_reminder; match guidance file basenames exactly

- each run exports a JSON report to files/skill-evals/ - promptfoo db/cache and the SDK install move from the home directory to .build/ (tool state, not output) — nothing lands in /Users/lizhechen anymore - run-skill-eval only triggers on guidance changes during bulk manual runs; documented invocation gains --all-files - README: cleanup section, stage-before-prek-run reminder

Direct npx invocation needed a manually installed SDK (promptfoo resolves it from the config directory); the prek env provides it. eval.py now points at when promptfoo is absent. Promptfoo flags remain wireable via hook-variant entry args.

- 62343: merged with 62343.feature.rst, ground truth should_create=true - 65958: proposed as a should-create example by its author in review - view-skill-eval manual hook reuses run-skill-eval's node env

RoyLee1224 requested review from amoghrajesh, ashb, bugraoz93, choo121600, ephraimbuddy, gopidesupavan, jason810496, jedcunningham, jscheffl, potiuk and vatsrahul1001 as code owners July 3, 2026 08:02

boring-cyborg Bot added area:dev-tools backport-to-v3-3-test Backport to v3-3-test labels Jul 3, 2026

RoyLee1224 force-pushed the feat/skill-eval-harness branch from b73f705 to e95339b Compare July 3, 2026 08:02

RoyLee1224 changed the title ~~Feat/skill eval harness~~ Add eval harness for testing AGENTS.md changes Jul 3, 2026

potiuk reviewed Jul 3, 2026

View reviewed changes

Comment thread scripts/ci/prek/check_eval_reminder.py Outdated

potiuk reviewed Jul 3, 2026

View reviewed changes

Comment thread dev/skill-evals/eval.py

jason810496 reviewed Jul 3, 2026

View reviewed changes

jason810496 removed the backport-to-v3-3-test Backport to v3-3-test label Jul 3, 2026

RoyLee1224 force-pushed the feat/skill-eval-harness branch 4 times, most recently from fa09f92 to 1354e54 Compare July 4, 2026 15:54

RoyLee1224 mentioned this pull request Jul 5, 2026

Add internal development MCP server for Airflow API as part of Breeze #69381

Open

1 task

RoyLee1224 force-pushed the feat/skill-eval-harness branch from 254e3ee to bf5eedd Compare July 5, 2026 12:38

RoyLee1224 added 3 commits July 5, 2026 21:48

Add skill-eval harness scaffold with promptfoo

ae419c1

ci: Add prek hook to remind eval on AGENTS.md and SKILL.md changes

5ba0a65

ci: add OAuth auth and runtime config generation to skill-eval harness

8b1aeca

RoyLee1224 added 13 commits July 5, 2026 21:48

feat: add skill-eval harness for AGENTS.md regression testing

f4e702e

docs: update skill-eval README and use Helm routing as starter case

8e8243d

fix: point skill-eval reminder hook at eval.py, not nonexistent eval.sh

e5eda04

refactor: skip working arm when unchanged, show model in output

1ddad7d

refactor: skip working arm when unchanged, show model, use promptfoo@…

c603d6a

…latest

refactor: replace command-routing cases with newsfragment cases from …

7cef7d7

…real PRs

feat: replace eval reminder hook with hash-based proof gate

c525503

feat: run skill-eval via prek-managed node env, guard partial runs

1d1218f

refactor: build promptfoo config as a dict, serialize to JSON

384a1d4

Add positive newsfragment cases and view-skill-eval hook

344296a

- 62343: merged with 62343.feature.rst, ground truth should_create=true - 65958: proposed as a should-create example by its author in review - view-skill-eval manual hook reuses run-skill-eval's node env

RoyLee1224 force-pushed the feat/skill-eval-harness branch from bf5eedd to 344296a Compare July 5, 2026 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add eval harness for testing AGENTS.md changes#69308

Add eval harness for testing AGENTS.md changes#69308
RoyLee1224 wants to merge 16 commits into
apache:mainfrom
RoyLee1224:feat/skill-eval-harness

RoyLee1224 commented Jul 3, 2026 •

edited

Loading

Uh oh!

RoyLee1224 commented Jul 3, 2026

Uh oh!

potiuk commented Jul 3, 2026

Uh oh!

Uh oh!

Uh oh!

jason810496 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

RoyLee1224 commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Demo: testing the newsfragment golden rule

cli demo

UI demo(run npx promptfoo@0.121.17 view):

How it works

Files

Future use

Was generative AI tooling used to co-author this PR?

Uh oh!

RoyLee1224 commented Jul 3, 2026

Uh oh!

potiuk commented Jul 3, 2026

Uh oh!

Uh oh!

Uh oh!

jason810496 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RoyLee1224 commented Jul 3, 2026 •

edited

Loading

UI demo(run `npx promptfoo@0.121.17 view`):