Add the LLM Validation gate for AGENTS.md + the libdatadog-bump skill#8845
Add the LLM Validation gate for AGENTS.md + the libdatadog-bump skill#8845NachoEchevarria wants to merge 13 commits into
Conversation
BenchmarksBenchmark execution time: 2026-07-03 14:58:18 Comparing candidate commit 94fd792 in PR branch Found 0 performance improvements and 1 performance regressions! Performance is the same for 71 metrics, 0 unstable metrics, 59 known flaky benchmarks, 67 flaky benchmarks without significant changes.
|
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (8845) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8845) - mean (70ms) : 68, 72
master - mean (72ms) : 69, 75
section Bailout
This PR (8845) - mean (74ms) : 73, 75
master - mean (74ms) : 72, 77
section CallTarget+Inlining+NGEN
This PR (8845) - mean (1,086ms) : 1036, 1135
master - mean (1,082ms) : 1039, 1124
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8845) - mean (109ms) : 106, 112
master - mean (113ms) : 107, 119
section Bailout
This PR (8845) - mean (110ms) : 108, 112
master - mean (110ms) : 108, 113
section CallTarget+Inlining+NGEN
This PR (8845) - mean (778ms) : 759, 798
master - mean (781ms) : 757, 805
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8845) - mean (97ms) : 93, 101
master - mean (96ms) : 93, 99
section Bailout
This PR (8845) - mean (98ms) : 96, 99
master - mean (98ms) : 95, 100
section CallTarget+Inlining+NGEN
This PR (8845) - mean (945ms) : 903, 988
master - mean (940ms) : 898, 983
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8845) - mean (98ms) : 91, 104
master - mean (99ms) : 92, 105
section Bailout
This PR (8845) - mean (99ms) : 93, 104
master - mean (96ms) : 93, 100
section CallTarget+Inlining+NGEN
This PR (8845) - mean (812ms) : 778, 846
master - mean (814ms) : 781, 847
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8845) - mean (202ms) : 197, 206
master - mean (203ms) : 198, 208
section Bailout
This PR (8845) - mean (206ms) : 203, 210
master - mean (206ms) : 202, 210
section CallTarget+Inlining+NGEN
This PR (8845) - mean (1,217ms) : 1151, 1283
master - mean (1,210ms) : 1173, 1246
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8845) - mean (293ms) : 286, 300
master - mean (291ms) : 283, 298
section Bailout
This PR (8845) - mean (295ms) : 289, 301
master - mean (294ms) : 289, 298
section CallTarget+Inlining+NGEN
This PR (8845) - mean (973ms) : 955, 992
master - mean (975ms) : 952, 998
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8845) - mean (285ms) : 278, 292
master - mean (286ms) : 278, 294
section Bailout
This PR (8845) - mean (285ms) : 280, 290
master - mean (286ms) : 282, 290
section CallTarget+Inlining+NGEN
This PR (8845) - mean (1,173ms) : 1140, 1207
master - mean (1,171ms) : 1132, 1210
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8845) - mean (282ms) : 276, 289
master - mean (284ms) : 277, 291
section Bailout
This PR (8845) - mean (283ms) : 278, 287
master - mean (284ms) : 278, 290
section CallTarget+Inlining+NGEN
This PR (8845) - mean (1,050ms) : 1004, 1095
master - mean (1,051ms) : 1005, 1097
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LLM ValidationLLM Validation Gate — dotnet-tracer-agent✅ PASS
AnalysisChanged instruction file(s): 1 case(s) dipped but stayed within the pass bar.
Results
Cases
Per-dimension scores, token usage, latency, and estimated cost are in the CI job logs. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3d3520a32a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| through everything I have to update in this repo and how to verify it. | ||
| expected_criteria: | ||
| - Updates BOTH version pins — build/cmake/FindLibdatadog.cmake (Linux/macOS) and the Windows vcpkg port (vcpkg.json + portfile.cmake). | ||
| - FindLibdatadog.cmake uses SHA-256 hashes and a v-prefixed version (e.g. v32.0.0); the vcpkg version-string has NO v prefix. |
There was a problem hiding this comment.
Use the libdatadog-dotnet version in this benchmark
When .claude/skills/bump-libdatadog/SKILL.md changes, this is the blocking skill case, but the example here uses upstream libdatadog v32.0.0; the existing skill explicitly says the repo pins the separate libdatadog-dotnet release version, and the current CMake pin is v2.0.0. This can make the judge reward an answer that bumps to the wrong GitHub release tag (or penalize the correct upstream-vs-dotnet distinction), so the criterion should use a libdatadog-dotnet example and call out that upstream versions are not the pinned value.
Useful? React with 👍 / 👎.
| # List more than one to validate the combined instruction set (e.g. AGENTS.md + a Claude skill file). | ||
| instruction_files: | ||
| - AGENTS.md | ||
| - .claude/skills/bump-libdatadog/SKILL.md # also gate the libdatadog-bump skill |
There was a problem hiding this comment.
Gate the skill's helper script too
When only .claude/skills/bump-libdatadog/scripts/fetch-release-hashes.sh changes, the comments here say the CI trigger derives its watch paths from instruction_files, but the only watched skill path is SKILL.md. A broken checksum-fetching script would therefore bypass dotnet-tracer-bump-libdatadog-013 even though the skill tells agents to run it, so include the whole skill directory or the script path in the watched inputs.
Useful? React with 👍 / 👎.
| # footprint in this repo is the local .llm-validation/ suite. Pinned to the platform repo's `main`, which | ||
| # carries ci/ + the CLI on the gitlab.ddbuild.io mirror. See that repo's ci/README.md for configuration. | ||
| - project: 'DataDog/llm-validation-platform' | ||
| ref: main |
There was a problem hiding this comment.
Pin the external CI template to an immutable ref
For every pipeline after this merge, this include follows the platform repo's moving main branch, so a later template change can alter or break dd-trace-dotnet's merge gate without any change in this repo. Since this job can block PRs, use a tag or commit SHA and bump it deliberately instead of relying on a mutable branch.
Useful? React with 👍 / 👎.
What
Adopts the LLM Validation gate in dd-trace-dotnet. When a PR changes an AI-behavior file (
AGENTS.mdor a gated Claude skill), CI runs a benchmark suite that compares the baseline (master's instructions) against the candidate (this PR's) under an identical model / judge / case set, then posts a PASS/WARN/FAIL comment. It answers "did this doc edit actually make the agent better or worse?" with a blind, repeated pairwise signal instead of eyeballing prose.Footprint in this repo
.llm-validation/config.yaml— what's monitored (AGENTS.md+.claude/skills/bump-libdatadog/SKILL.md), the level presets, and the gate policy..llm-validation/suites/dotnet-tracer-agent-v0.1.yaml— 17 tracer-specific benchmark cases (repo navigation, config keys, logging terminology, instrumentation debugging, the libdatadog bump, plus control cases)..gitlab-ci.yml— a singleinclude:of the reusable job shipped by the platform repo. That's the whole footprint; the engine (CLI, judge, runner) lives in the platform repo.How it behaves
gate: ~6 curated high-signal cases at 8 runs each;files:targeting narrows further to the file(s) that actually changed.authanywhere), so it can't gate fork/external PRs.The AGENTS.md change
The one-character em-dash tweak in
AGENTS.mdis intentional: a trivial, unarguably-non-regressing edit that trips the gate so this PR's own pipeline demonstrates it running — rather than merging the gate unexercised. Expected verdict: PASS (no regression).