Skip to content

Add the LLM Validation gate for AGENTS.md + the libdatadog-bump skill#8845

Open
NachoEchevarria wants to merge 13 commits into
masterfrom
nacho/LLMPlatformJob
Open

Add the LLM Validation gate for AGENTS.md + the libdatadog-bump skill#8845
NachoEchevarria wants to merge 13 commits into
masterfrom
nacho/LLMPlatformJob

Conversation

@NachoEchevarria

@NachoEchevarria NachoEchevarria commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

What

Adopts the LLM Validation gate in dd-trace-dotnet. When a PR changes an AI-behavior file (AGENTS.md or a gated Claude skill), CI runs a benchmark suite that compares the baseline (master's instructions) against the candidate (this PR's) under an identical model / judge / case set, then posts a PASS/WARN/FAIL comment. It answers "did this doc edit actually make the agent better or worse?" with a blind, repeated pairwise signal instead of eyeballing prose.

Footprint in this repo

  • .llm-validation/config.yaml — what's monitored (AGENTS.md + .claude/skills/bump-libdatadog/SKILL.md), the level presets, and the gate policy.
  • .llm-validation/suites/dotnet-tracer-agent-v0.1.yaml — 17 tracer-specific benchmark cases (repo navigation, config keys, logging terminology, instrumentation debugging, the libdatadog bump, plus control cases).
  • .gitlab-ci.yml — a single include: of the reusable job shipped by the platform repo. That's the whole footprint; the engine (CLI, judge, runner) lives in the platform repo.

How it behaves

  • Triggers only when a monitored file changes (early git-diff skip otherwise, so other PRs cost ~nothing).
  • Default preset gate: ~6 curated high-signal cases at 8 runs each; files: targeting narrows further to the file(s) that actually changed.
  • Agent-under-test is Claude Code headless, run in a checkout of this repo (real navigation). One blind comparative judge, order-swapped, produces a win-rate. The gate only blocks on a confident regression (a new safety/bad signal, or a tight-CI pairwise loss); noisy or marginal changes WARN, never block.
  • Internal-only: runs on the ddbuild GitLab pipeline (needs the AI Gateway + authanywhere), so it can't gate fork/external PRs.

The AGENTS.md change

The one-character em-dash tweak in AGENTS.md is intentional: a trivial, unarguably-non-regressing edit that trips the gate so this PR's own pipeline demonstrates it running — rather than merging the gate unexercised. Expected verdict: PASS (no regression).

@NachoEchevarria NachoEchevarria added the area:builds project files, build scripts, pipelines, versioning, releases, packages label Jun 29, 2026
@pr-commenter

pr-commenter Bot commented Jun 29, 2026

Copy link
Copy Markdown

Benchmarks

Benchmark execution time: 2026-07-03 14:58:18

Comparing candidate commit 94fd792 in PR branch nacho/LLMPlatformJob with baseline commit bb5a507 in branch master.

📊 Benchmarking dashboard

Found 0 performance improvements and 1 performance regressions! Performance is the same for 71 metrics, 0 unstable metrics, 59 known flaky benchmarks, 67 flaky benchmarks without significant changes.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:Benchmarks.Trace.DbCommandBenchmark.ExecuteNonQuery net472

  • 🟥 throughput [-22858.583op/s; -18870.138op/s] or [-6.438%; -5.315%]

Known flaky benchmarks

These benchmarks are marked as flaky and will not trigger a failure. Modify FLAKY_BENCHMARKS_REGEX to control which benchmarks are marked as flaky.

scenario:Benchmarks.Trace.ActivityBenchmark.StartStopWithChild net472

  • 🟥 throughput [-7117.922op/s; -6584.534op/s] or [-8.440%; -7.807%]

scenario:Benchmarks.Trace.AgentWriterBenchmark.WriteAndFlushEnrichedTraces net472

  • 🟥 execution_time [+317.345ms; +322.765ms] or [+157.478%; +160.168%]
  • 🟥 throughput [-42.557op/s; -38.418op/s] or [-7.657%; -6.912%]

scenario:Benchmarks.Trace.AgentWriterBenchmark.WriteAndFlushEnrichedTraces net6.0

  • 🟥 execution_time [+378.287ms; +380.391ms] or [+298.870%; +300.532%]
  • 🟩 throughput [+93.713op/s; +97.270op/s] or [+12.356%; +12.825%]

scenario:Benchmarks.Trace.AgentWriterBenchmark.WriteAndFlushEnrichedTraces netcoreapp3.1

  • 🟥 execution_time [+393.503ms; +395.614ms] or [+348.235%; +350.103%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.AllCycleMoreComplexBody net472

  • 🟥 allocated_mem [+1.308KB; +1.308KB] or [+27.528%; +27.540%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.AllCycleMoreComplexBody net6.0

  • 🟥 allocated_mem [+471 bytes; +472 bytes] or [+9.976%; +9.987%]
  • 🟩 execution_time [-15.820ms; -11.625ms] or [-7.389%; -5.429%]
  • 🟩 throughput [+7049.344op/s; +9819.150op/s] or [+5.146%; +7.167%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.AllCycleMoreComplexBody netcoreapp3.1

  • 🟥 allocated_mem [+1.272KB; +1.272KB] or [+27.500%; +27.510%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.AllCycleSimpleBody net472

  • 🟥 allocated_mem [+1.307KB; +1.307KB] or [+105.743%; +105.758%]
  • 🟥 throughput [-259311.183op/s; -255169.894op/s] or [-26.477%; -26.054%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.AllCycleSimpleBody net6.0

  • 🟥 allocated_mem [+471 bytes; +472 bytes] or [+38.557%; +38.566%]
  • 🟩 execution_time [-26.358ms; -20.002ms] or [-11.754%; -8.920%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.AllCycleSimpleBody netcoreapp3.1

  • 🟥 allocated_mem [+1.272KB; +1.272KB] or [+105.288%; +105.304%]
  • 🟥 throughput [-163489.990op/s; -137716.062op/s] or [-23.490%; -19.787%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorMoreComplexBody net6.0

  • 🟩 throughput [+9237.004op/s; +12237.551op/s] or [+5.877%; +7.787%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorMoreComplexBody netcoreapp3.1

  • 🟩 throughput [+10248.815op/s; +12923.315op/s] or [+8.165%; +10.295%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorSimpleBody net6.0

  • 🟩 throughput [+477695.953op/s; +499765.399op/s] or [+15.928%; +16.664%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorSimpleBody netcoreapp3.1

  • 🟩 execution_time [-18.699ms; -14.368ms] or [-8.619%; -6.623%]

scenario:Benchmarks.Trace.Asm.AppSecEncoderBenchmark.EncodeArgs net472

  • 🟥 execution_time [+300.060ms; +300.767ms] or [+149.930%; +150.283%]

scenario:Benchmarks.Trace.Asm.AppSecEncoderBenchmark.EncodeArgs net6.0

  • 🟥 execution_time [+299.702ms; +302.789ms] or [+151.140%; +152.697%]

scenario:Benchmarks.Trace.Asm.AppSecEncoderBenchmark.EncodeArgs netcoreapp3.1

  • 🟥 execution_time [+299.986ms; +303.074ms] or [+151.110%; +152.665%]

scenario:Benchmarks.Trace.Asm.AppSecEncoderBenchmark.EncodeLegacyArgs net472

  • 🟥 execution_time [+297.556ms; +298.924ms] or [+146.148%; +146.820%]

scenario:Benchmarks.Trace.Asm.AppSecEncoderBenchmark.EncodeLegacyArgs net6.0

  • 🟥 execution_time [+291.863ms; +295.182ms] or [+142.681%; +144.303%]

scenario:Benchmarks.Trace.Asm.AppSecEncoderBenchmark.EncodeLegacyArgs netcoreapp3.1

  • 🟥 execution_time [+298.057ms; +300.793ms] or [+148.969%; +150.336%]

scenario:Benchmarks.Trace.Asm.AppSecWafBenchmark.RunWafRealisticBenchmarkWithAttack net6.0

  • 🟥 execution_time [+22.932µs; +46.606µs] or [+7.321%; +14.879%]
  • 🟥 throughput [-433.399op/s; -234.236op/s] or [-13.510%; -7.302%]

scenario:Benchmarks.Trace.AspNetCoreBenchmark.SendRequest net472

  • 🟥 execution_time [+299.756ms; +300.519ms] or [+149.609%; +149.990%]

scenario:Benchmarks.Trace.AspNetCoreBenchmark.SendRequest net6.0

  • unstable execution_time [+332.283ms; +390.367ms] or [+361.039%; +424.150%]
  • 🟩 throughput [+930.953op/s; +1129.115op/s] or [+7.650%; +9.278%]

scenario:Benchmarks.Trace.AspNetCoreBenchmark.SendRequest netcoreapp3.1

  • 🟥 execution_time [+367.911ms; +371.433ms] or [+279.351%; +282.026%]

scenario:Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces net472

  • unstable execution_time [+321.500ms; +376.179ms] or [+147.823%; +172.963%]
  • 🟥 throughput [-508.698op/s; -464.563op/s] or [-46.093%; -42.094%]

scenario:Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces net6.0

  • unstable execution_time [+202.405ms; +335.807ms] or [+86.256%; +143.107%]
  • 🟥 throughput [-689.891op/s; -605.843op/s] or [-46.016%; -40.410%]

scenario:Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces netcoreapp3.1

  • 🟥 execution_time [+329.368ms; +336.792ms] or [+197.000%; +201.441%]
  • 🟥 throughput [-378.885op/s; -343.441op/s] or [-26.381%; -23.913%]

scenario:Benchmarks.Trace.CharSliceBenchmark.OptimizedCharSliceWithPool net6.0

  • 🟩 throughput [+47.247op/s; +67.614op/s] or [+5.094%; +7.290%]

scenario:Benchmarks.Trace.CharSliceBenchmark.OriginalCharSlice net6.0

  • 🟩 throughput [+26.944op/s; +43.640op/s] or [+5.319%; +8.615%]

scenario:Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearch net472

  • 🟥 execution_time [+300.504ms; +303.547ms] or [+151.328%; +152.861%]

scenario:Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearch net6.0

  • 🟥 execution_time [+300.612ms; +301.959ms] or [+150.637%; +151.312%]

scenario:Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearch netcoreapp3.1

  • 🟥 execution_time [+301.075ms; +305.710ms] or [+151.247%; +153.576%]

scenario:Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearchAsync net472

  • 🟥 execution_time [+302.615ms; +303.924ms] or [+151.963%; +152.620%]

scenario:Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearchAsync net6.0

  • 🟥 execution_time [+297.338ms; +298.972ms] or [+147.020%; +147.828%]

scenario:Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearchAsync netcoreapp3.1

  • 🟥 execution_time [+302.021ms; +305.601ms] or [+153.078%; +154.892%]

scenario:Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync net472

  • 🟥 execution_time [+300.191ms; +302.149ms] or [+150.669%; +151.652%]

scenario:Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync net6.0

  • 🟥 execution_time [+300.880ms; +302.981ms] or [+149.961%; +151.008%]
  • 🟩 throughput [+45360.208op/s; +50581.735op/s] or [+9.007%; +10.044%]

scenario:Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync netcoreapp3.1

  • 🟥 execution_time [+301.611ms; +304.500ms] or [+150.049%; +151.486%]

scenario:Benchmarks.Trace.ILoggerBenchmark.EnrichedLog net6.0

  • 🟩 execution_time [-16.297ms; -12.606ms] or [-7.578%; -5.862%]
  • 🟩 throughput [+23891.331op/s; +30597.357op/s] or [+6.554%; +8.394%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatAspectBenchmark net472

  • unstable execution_time [+14.387µs; +57.667µs] or [+3.554%; +14.244%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatAspectBenchmark net6.0

  • 🟩 allocated_mem [-25.540KB; -25.516KB] or [-9.316%; -9.308%]
  • unstable execution_time [-61.126µs; -7.347µs] or [-12.081%; -1.452%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatAspectBenchmark netcoreapp3.1

  • unstable execution_time [-49.618µs; +12.189µs] or [-8.599%; +2.112%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatBenchmark net6.0

  • unstable execution_time [+6.739µs; +11.536µs] or [+15.929%; +27.267%]
  • 🟥 throughput [-5065.524op/s; -3138.179op/s] or [-21.324%; -13.211%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatBenchmark netcoreapp3.1

  • unstable execution_time [-13.686µs; -5.912µs] or [-21.233%; -9.172%]
  • unstable throughput [+1431.958op/s; +3102.722op/s] or [+8.786%; +19.036%]

scenario:Benchmarks.Trace.Log4netBenchmark.EnrichedLog net472

  • 🟥 execution_time [+301.314ms; +302.354ms] or [+152.301%; +152.826%]

scenario:Benchmarks.Trace.Log4netBenchmark.EnrichedLog net6.0

  • 🟥 execution_time [+302.597ms; +305.005ms] or [+154.021%; +155.247%]

scenario:Benchmarks.Trace.Log4netBenchmark.EnrichedLog netcoreapp3.1

  • 🟥 execution_time [+300.485ms; +302.928ms] or [+150.430%; +151.653%]

scenario:Benchmarks.Trace.SerilogBenchmark.EnrichedLog net472

  • 🟥 execution_time [+298.608ms; +300.580ms] or [+148.829%; +149.812%]

scenario:Benchmarks.Trace.SerilogBenchmark.EnrichedLog net6.0

  • 🟥 execution_time [+301.954ms; +303.215ms] or [+151.627%; +152.260%]

scenario:Benchmarks.Trace.SerilogBenchmark.EnrichedLog netcoreapp3.1

  • 🟥 execution_time [+303.442ms; +305.722ms] or [+153.886%; +155.043%]

scenario:Benchmarks.Trace.SingleSpanAspNetCoreBenchmark.SingleSpanAspNetCore net472

  • 🟥 execution_time [+299.833ms; +300.838ms] or [+149.558%; +150.060%]
  • 🟩 throughput [+61017629.045op/s; +61369280.650op/s] or [+44.437%; +44.693%]

scenario:Benchmarks.Trace.SingleSpanAspNetCoreBenchmark.SingleSpanAspNetCore net6.0

  • 🟥 execution_time [+417.879ms; +420.894ms] or [+519.707%; +523.457%]

scenario:Benchmarks.Trace.SingleSpanAspNetCoreBenchmark.SingleSpanAspNetCore netcoreapp3.1

  • 🟥 execution_time [+299.718ms; +300.978ms] or [+149.493%; +150.121%]

scenario:Benchmarks.Trace.SpanBenchmark.StartFinishScope net6.0

  • 🟩 throughput [+100036.307op/s; +109981.833op/s] or [+9.340%; +10.269%]

scenario:Benchmarks.Trace.SpanBenchmark.StartFinishScope netcoreapp3.1

  • 🟩 throughput [+57347.849op/s; +78244.762op/s] or [+6.638%; +9.057%]

scenario:Benchmarks.Trace.SpanBenchmark.StartFinishSpan net6.0

  • 🟩 throughput [+90871.285op/s; +121792.984op/s] or [+7.034%; +9.427%]

scenario:Benchmarks.Trace.SpanBenchmark.StartFinishSpan netcoreapp3.1

  • 🟩 throughput [+90842.892op/s; +97934.267op/s] or [+9.022%; +9.726%]

scenario:Benchmarks.Trace.SpanBenchmark.StartFinishTwoScopes net6.0

  • 🟩 throughput [+41475.635op/s; +49252.899op/s] or [+7.531%; +8.943%]

scenario:Benchmarks.Trace.TraceAnnotationsBenchmark.RunOnMethodBegin net6.0

  • 🟩 throughput [+68230.739op/s; +88728.172op/s] or [+7.623%; +9.913%]

Known flaky benchmarks without significant changes:

  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_AddEvent_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_AddEvent_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_AddEvent_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_GetContext_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_GetContext_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_GetContext_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_SetAttributes_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_SetAttributes_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_SetAttributes_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_SetStatus_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_SetStatus_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_SetStatus_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_UpdateName_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_UpdateName_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.ActivityBenchmark.StartSpan_UpdateName_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_AddEvent_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_AddEvent_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_AddEvent_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_GetContext_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_GetContext_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_GetContext_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_RecordException_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_RecordException_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_RecordException_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_SetAttributes_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_SetAttributes_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_SetAttributes_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_SetStatus_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_SetStatus_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_SetStatus_Sampled netcoreapp3.1
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_UpdateName_Sampled net472
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_UpdateName_Sampled net6.0
  • scenario:Benchmarks.OpenTelemetry.InstrumentedApi.Trace.TelemetrySpanBenchmark.StartSpan_UpdateName_Sampled netcoreapp3.1
  • scenario:Benchmarks.Trace.ActivityBenchmark.StartStopWithChild net6.0
  • scenario:Benchmarks.Trace.ActivityBenchmark.StartStopWithChild netcoreapp3.1
  • scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorMoreComplexBody net472
  • scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorSimpleBody net472
  • scenario:Benchmarks.Trace.Asm.AppSecWafBenchmark.RunWafRealisticBenchmark net472
  • scenario:Benchmarks.Trace.Asm.AppSecWafBenchmark.RunWafRealisticBenchmark net6.0
  • scenario:Benchmarks.Trace.Asm.AppSecWafBenchmark.RunWafRealisticBenchmark netcoreapp3.1
  • scenario:Benchmarks.Trace.Asm.AppSecWafBenchmark.RunWafRealisticBenchmarkWithAttack net472
  • scenario:Benchmarks.Trace.Asm.AppSecWafBenchmark.RunWafRealisticBenchmarkWithAttack netcoreapp3.1
  • scenario:Benchmarks.Trace.CharSliceBenchmark.OptimizedCharSlice net472
  • scenario:Benchmarks.Trace.CharSliceBenchmark.OptimizedCharSlice net6.0
  • scenario:Benchmarks.Trace.CharSliceBenchmark.OptimizedCharSlice netcoreapp3.1
  • scenario:Benchmarks.Trace.CharSliceBenchmark.OptimizedCharSliceWithPool net472
  • scenario:Benchmarks.Trace.CharSliceBenchmark.OptimizedCharSliceWithPool netcoreapp3.1
  • scenario:Benchmarks.Trace.CharSliceBenchmark.OriginalCharSlice net472
  • scenario:Benchmarks.Trace.CharSliceBenchmark.OriginalCharSlice netcoreapp3.1
  • scenario:Benchmarks.Trace.ILoggerBenchmark.EnrichedLog net472
  • scenario:Benchmarks.Trace.ILoggerBenchmark.EnrichedLog netcoreapp3.1
  • scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatBenchmark net472
  • scenario:Benchmarks.Trace.RedisBenchmark.SendReceive net472
  • scenario:Benchmarks.Trace.RedisBenchmark.SendReceive net6.0
  • scenario:Benchmarks.Trace.RedisBenchmark.SendReceive netcoreapp3.1
  • scenario:Benchmarks.Trace.SpanBenchmark.StartFinishScope net472
  • scenario:Benchmarks.Trace.SpanBenchmark.StartFinishSpan net472
  • scenario:Benchmarks.Trace.SpanBenchmark.StartFinishTwoScopes net472
  • scenario:Benchmarks.Trace.SpanBenchmark.StartFinishTwoScopes netcoreapp3.1
  • scenario:Benchmarks.Trace.TraceAnnotationsBenchmark.RunOnMethodBegin net472
  • scenario:Benchmarks.Trace.TraceAnnotationsBenchmark.RunOnMethodBegin netcoreapp3.1

@dd-trace-dotnet-ci-bot

dd-trace-dotnet-ci-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing This PR (8845) and master.

✅ No regressions detected - check the details below

Full Metrics Comparison

FakeDbCommand

Metric Master (Mean ± 95% CI) Current (Mean ± 95% CI) Change Status
.NET Framework 4.8 - Baseline
duration71.86 ± (71.80 - 72.26) ms69.78 ± (69.81 - 70.10) ms-2.9%
.NET Framework 4.8 - Bailout
duration74.17 ± (74.13 - 74.53) ms74.03 ± (73.90 - 74.19) ms-0.2%
.NET Framework 4.8 - CallTarget+Inlining+NGEN
duration1080.53 ± (1078.82 - 1084.59) ms1082.80 ± (1082.37 - 1089.12) ms+0.2%✅⬆️
.NET Core 3.1 - Baseline
process.internal_duration_ms22.45 ± (22.40 - 22.50) ms22.00 ± (21.96 - 22.03) ms-2.0%
process.time_to_main_ms82.87 ± (82.59 - 83.15) ms80.61 ± (80.44 - 80.78) ms-2.7%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed10.91 ± (10.91 - 10.91) MB10.93 ± (10.92 - 10.93) MB+0.1%✅⬆️
runtime.dotnet.threads.count12 ± (12 - 12)12 ± (12 - 12)+0.0%
.NET Core 3.1 - Bailout
process.internal_duration_ms22.09 ± (22.06 - 22.13) ms21.88 ± (21.85 - 21.91) ms-1.0%
process.time_to_main_ms81.71 ± (81.58 - 81.84) ms81.61 ± (81.48 - 81.74) ms-0.1%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed10.95 ± (10.95 - 10.96) MB10.97 ± (10.96 - 10.97) MB+0.2%✅⬆️
runtime.dotnet.threads.count13 ± (13 - 13)13 ± (13 - 13)+0.0%
.NET Core 3.1 - CallTarget+Inlining+NGEN
process.internal_duration_ms212.06 ± (211.17 - 212.96) ms210.10 ± (209.12 - 211.09) ms-0.9%
process.time_to_main_ms531.66 ± (530.49 - 532.83) ms530.95 ± (529.49 - 532.41) ms-0.1%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed49.18 ± (49.15 - 49.21) MB49.17 ± (49.14 - 49.21) MB-0.0%
runtime.dotnet.threads.count28 ± (28 - 28)28 ± (28 - 28)+0.2%✅⬆️
.NET 6 - Baseline
process.internal_duration_ms20.81 ± (20.78 - 20.84) ms21.02 ± (20.99 - 21.05) ms+1.0%✅⬆️
process.time_to_main_ms69.69 ± (69.57 - 69.81) ms70.27 ± (70.11 - 70.43) ms+0.8%✅⬆️
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed10.63 ± (10.63 - 10.63) MB10.64 ± (10.64 - 10.64) MB+0.1%✅⬆️
runtime.dotnet.threads.count10 ± (10 - 10)10 ± (10 - 10)+0.0%
.NET 6 - Bailout
process.internal_duration_ms20.76 ± (20.72 - 20.80) ms20.93 ± (20.89 - 20.96) ms+0.8%✅⬆️
process.time_to_main_ms70.64 ± (70.52 - 70.77) ms70.89 ± (70.78 - 71.01) ms+0.4%✅⬆️
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed10.75 ± (10.74 - 10.75) MB10.76 ± (10.76 - 10.76) MB+0.1%✅⬆️
runtime.dotnet.threads.count11 ± (11 - 11)11 ± (11 - 11)+0.0%
.NET 6 - CallTarget+Inlining+NGEN
process.internal_duration_ms370.61 ± (368.17 - 373.06) ms371.80 ± (369.52 - 374.09) ms+0.3%✅⬆️
process.time_to_main_ms536.79 ± (535.64 - 537.93) ms538.17 ± (537.01 - 539.33) ms+0.3%✅⬆️
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed50.33 ± (50.31 - 50.36) MB50.23 ± (50.21 - 50.25) MB-0.2%
runtime.dotnet.threads.count28 ± (28 - 28)28 ± (28 - 28)-0.1%
.NET 8 - Baseline
process.internal_duration_ms19.48 ± (19.43 - 19.54) ms19.34 ± (19.30 - 19.38) ms-0.7%
process.time_to_main_ms71.71 ± (71.46 - 71.96) ms71.07 ± (70.80 - 71.35) ms-0.9%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed7.68 ± (7.67 - 7.68) MB7.70 ± (7.69 - 7.70) MB+0.3%✅⬆️
runtime.dotnet.threads.count10 ± (10 - 10)10 ± (10 - 10)+0.0%
.NET 8 - Bailout
process.internal_duration_ms19.14 ± (19.11 - 19.17) ms19.24 ± (19.20 - 19.27) ms+0.5%✅⬆️
process.time_to_main_ms70.53 ± (70.37 - 70.69) ms71.98 ± (71.76 - 72.21) ms+2.1%✅⬆️
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed7.72 ± (7.72 - 7.73) MB7.73 ± (7.73 - 7.74) MB+0.2%✅⬆️
runtime.dotnet.threads.count11 ± (11 - 11)11 ± (11 - 11)+0.0%
.NET 8 - CallTarget+Inlining+NGEN
process.internal_duration_ms299.62 ± (297.52 - 301.73) ms297.28 ± (294.95 - 299.61) ms-0.8%
process.time_to_main_ms485.90 ± (484.92 - 486.88) ms484.69 ± (483.71 - 485.68) ms-0.2%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed37.70 ± (37.67 - 37.73) MB37.73 ± (37.70 - 37.76) MB+0.1%✅⬆️
runtime.dotnet.threads.count27 ± (27 - 27)27 ± (27 - 27)+0.1%✅⬆️

HttpMessageHandler

Metric Master (Mean ± 95% CI) Current (Mean ± 95% CI) Change Status
.NET Framework 4.8 - Baseline
duration202.80 ± (202.62 - 203.49) ms201.74 ± (201.44 - 202.34) ms-0.5%
.NET Framework 4.8 - Bailout
duration205.86 ± (205.56 - 206.43) ms206.28 ± (206.02 - 206.69) ms+0.2%✅⬆️
.NET Framework 4.8 - CallTarget+Inlining+NGEN
duration1208.34 ± (1207.02 - 1212.60) ms1211.91 ± (1212.73 - 1221.40) ms+0.3%✅⬆️
.NET Core 3.1 - Baseline
process.internal_duration_ms195.99 ± (195.48 - 196.50) ms196.89 ± (196.44 - 197.35) ms+0.5%✅⬆️
process.time_to_main_ms84.97 ± (84.72 - 85.22) ms86.06 ± (85.74 - 86.39) ms+1.3%✅⬆️
runtime.dotnet.exceptions.count3 ± (3 - 3)3 ± (3 - 3)+0.0%
runtime.dotnet.mem.committed16.07 ± (16.05 - 16.09) MB16.01 ± (15.99 - 16.03) MB-0.4%
runtime.dotnet.threads.count20 ± (20 - 20)20 ± (20 - 20)-0.3%
.NET Core 3.1 - Bailout
process.internal_duration_ms196.79 ± (196.43 - 197.16) ms197.67 ± (197.28 - 198.05) ms+0.4%✅⬆️
process.time_to_main_ms87.08 ± (86.87 - 87.30) ms87.26 ± (86.98 - 87.54) ms+0.2%✅⬆️
runtime.dotnet.exceptions.count3 ± (3 - 3)3 ± (3 - 3)+0.0%
runtime.dotnet.mem.committed16.10 ± (16.08 - 16.12) MB16.08 ± (16.05 - 16.10) MB-0.1%
runtime.dotnet.threads.count21 ± (20 - 21)21 ± (21 - 21)+0.9%✅⬆️
.NET Core 3.1 - CallTarget+Inlining+NGEN
process.internal_duration_ms388.94 ± (387.48 - 390.41) ms387.06 ± (386.02 - 388.11) ms-0.5%
process.time_to_main_ms545.82 ± (544.71 - 546.93) ms545.27 ± (544.30 - 546.23) ms-0.1%
runtime.dotnet.exceptions.count3 ± (3 - 3)3 ± (3 - 3)+0.0%
runtime.dotnet.mem.committed58.46 ± (58.24 - 58.68) MB58.02 ± (57.82 - 58.22) MB-0.8%
runtime.dotnet.threads.count30 ± (30 - 30)30 ± (30 - 30)+0.0%✅⬆️
.NET 6 - Baseline
process.internal_duration_ms201.76 ± (201.32 - 202.21) ms201.04 ± (200.65 - 201.43) ms-0.4%
process.time_to_main_ms74.79 ± (74.50 - 75.07) ms74.21 ± (73.99 - 74.44) ms-0.8%
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed16.36 ± (16.32 - 16.39) MB16.37 ± (16.35 - 16.39) MB+0.1%✅⬆️
runtime.dotnet.threads.count19 ± (19 - 19)19 ± (19 - 19)-0.2%
.NET 6 - Bailout
process.internal_duration_ms201.17 ± (200.84 - 201.51) ms200.39 ± (200.01 - 200.76) ms-0.4%
process.time_to_main_ms75.51 ± (75.35 - 75.68) ms75.16 ± (74.94 - 75.38) ms-0.5%
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed16.44 ± (16.41 - 16.47) MB16.41 ± (16.38 - 16.45) MB-0.2%
runtime.dotnet.threads.count20 ± (20 - 20)20 ± (20 - 20)-0.4%
.NET 6 - CallTarget+Inlining+NGEN
process.internal_duration_ms583.54 ± (580.96 - 586.13) ms583.46 ± (581.17 - 585.76) ms-0.0%
process.time_to_main_ms553.95 ± (552.99 - 554.90) ms556.12 ± (554.93 - 557.32) ms+0.4%✅⬆️
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed61.42 ± (61.34 - 61.50) MB61.35 ± (61.26 - 61.44) MB-0.1%
runtime.dotnet.threads.count31 ± (31 - 31)31 ± (31 - 31)-0.5%
.NET 8 - Baseline
process.internal_duration_ms200.04 ± (199.64 - 200.44) ms198.63 ± (198.21 - 199.04) ms-0.7%
process.time_to_main_ms74.02 ± (73.75 - 74.29) ms73.46 ± (73.23 - 73.69) ms-0.8%
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed11.72 ± (11.70 - 11.74) MB11.75 ± (11.73 - 11.77) MB+0.2%✅⬆️
runtime.dotnet.threads.count19 ± (18 - 19)19 ± (18 - 19)-0.1%
.NET 8 - Bailout
process.internal_duration_ms198.91 ± (198.48 - 199.35) ms197.98 ± (197.56 - 198.39) ms-0.5%
process.time_to_main_ms74.95 ± (74.76 - 75.15) ms74.61 ± (74.40 - 74.82) ms-0.5%
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed11.77 ± (11.75 - 11.80) MB11.75 ± (11.73 - 11.77) MB-0.2%
runtime.dotnet.threads.count20 ± (19 - 20)19 ± (19 - 19)-1.3%
.NET 8 - CallTarget+Inlining+NGEN
process.internal_duration_ms512.08 ± (509.20 - 514.96) ms513.60 ± (510.45 - 516.75) ms+0.3%✅⬆️
process.time_to_main_ms506.86 ± (506.09 - 507.62) ms504.10 ± (503.21 - 504.99) ms-0.5%
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed51.17 ± (51.13 - 51.21) MB51.16 ± (51.12 - 51.20) MB-0.0%
runtime.dotnet.threads.count30 ± (30 - 30)30 ± (30 - 30)-0.1%
Comparison explanation

Execution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:

  • Welch test with statistical test for significance of 5%
  • Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

Duration charts
FakeDbCommand (.NET Framework 4.8)
gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (8845) - mean (70ms)  : 68, 72
    master - mean (72ms)  : 69, 75

    section Bailout
    This PR (8845) - mean (74ms)  : 73, 75
    master - mean (74ms)  : 72, 77

    section CallTarget+Inlining+NGEN
    This PR (8845) - mean (1,086ms)  : 1036, 1135
    master - mean (1,082ms)  : 1039, 1124

Loading
FakeDbCommand (.NET Core 3.1)
gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (8845) - mean (109ms)  : 106, 112
    master - mean (113ms)  : 107, 119

    section Bailout
    This PR (8845) - mean (110ms)  : 108, 112
    master - mean (110ms)  : 108, 113

    section CallTarget+Inlining+NGEN
    This PR (8845) - mean (778ms)  : 759, 798
    master - mean (781ms)  : 757, 805

Loading
FakeDbCommand (.NET 6)
gantt
    title Execution time (ms) FakeDbCommand (.NET 6)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (8845) - mean (97ms)  : 93, 101
    master - mean (96ms)  : 93, 99

    section Bailout
    This PR (8845) - mean (98ms)  : 96, 99
    master - mean (98ms)  : 95, 100

    section CallTarget+Inlining+NGEN
    This PR (8845) - mean (945ms)  : 903, 988
    master - mean (940ms)  : 898, 983

Loading
FakeDbCommand (.NET 8)
gantt
    title Execution time (ms) FakeDbCommand (.NET 8)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (8845) - mean (98ms)  : 91, 104
    master - mean (99ms)  : 92, 105

    section Bailout
    This PR (8845) - mean (99ms)  : 93, 104
    master - mean (96ms)  : 93, 100

    section CallTarget+Inlining+NGEN
    This PR (8845) - mean (812ms)  : 778, 846
    master - mean (814ms)  : 781, 847

Loading
HttpMessageHandler (.NET Framework 4.8)
gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (8845) - mean (202ms)  : 197, 206
    master - mean (203ms)  : 198, 208

    section Bailout
    This PR (8845) - mean (206ms)  : 203, 210
    master - mean (206ms)  : 202, 210

    section CallTarget+Inlining+NGEN
    This PR (8845) - mean (1,217ms)  : 1151, 1283
    master - mean (1,210ms)  : 1173, 1246

Loading
HttpMessageHandler (.NET Core 3.1)
gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (8845) - mean (293ms)  : 286, 300
    master - mean (291ms)  : 283, 298

    section Bailout
    This PR (8845) - mean (295ms)  : 289, 301
    master - mean (294ms)  : 289, 298

    section CallTarget+Inlining+NGEN
    This PR (8845) - mean (973ms)  : 955, 992
    master - mean (975ms)  : 952, 998

Loading
HttpMessageHandler (.NET 6)
gantt
    title Execution time (ms) HttpMessageHandler (.NET 6)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (8845) - mean (285ms)  : 278, 292
    master - mean (286ms)  : 278, 294

    section Bailout
    This PR (8845) - mean (285ms)  : 280, 290
    master - mean (286ms)  : 282, 290

    section CallTarget+Inlining+NGEN
    This PR (8845) - mean (1,173ms)  : 1140, 1207
    master - mean (1,171ms)  : 1132, 1210

Loading
HttpMessageHandler (.NET 8)
gantt
    title Execution time (ms) HttpMessageHandler (.NET 8)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (8845) - mean (282ms)  : 276, 289
    master - mean (284ms)  : 277, 291

    section Bailout
    This PR (8845) - mean (283ms)  : 278, 287
    master - mean (284ms)  : 278, 290

    section CallTarget+Inlining+NGEN
    This PR (8845) - mean (1,050ms)  : 1004, 1095
    master - mean (1,051ms)  : 1005, 1097

Loading

@pr-commenter

pr-commenter Bot commented Jun 29, 2026

Copy link
Copy Markdown

LLM Validation

LLM Validation Gate — dotnet-tracer-agent

✅ PASS

  • No blocking-case regressions; the quality change is within noise (baseline/candidate confidence intervals overlap).
  • Advisory: the candidate introduced a bad signal on 1 case(s) that the baseline did not — surfaced for review (see Cases); it does not block on its own.

Analysis

Changed instruction file(s): AGENTS.md.

1 case(s) dipped but stayed within the pass bar.

  • dotnet-tracer-confirm-instrumentation-011: won 47% of blind comparisons, quality -1.5. ⚠️ Introduced a new bad signal (advisory).
    • Asked: How can I confirm, from this repo and its runtime logs, whether a given library method is auto-instrumented — both where the instrumentation is declared in
    • Tripped 1 bad signal(s):
      • Invents a configuration file that lists instrumented methods
    • Baseline answer: Here's how to confirm whether a given library method is auto-instrumented in dd-trace-dotnet — the source-side declaration, and the runtime log lines that prove it fired. ## Part 1 — Where instrumentation is declared in source Auto-instrumentation is declared with attributes that a Roslyn source generator compiles
    • Candidate answer: Here's how to answer this end-to-end, grounded in this repo. ## 1. Where instrumentation is declared in source Auto-instrumentation for a given method is declared with an [InstrumentMethod] attribute on an integration class, under tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/<Area>/. The attribute

Results

  • Pairwise win-rate: 50% [46%–54%] — candidate's share of blind comparisons (90% CI; spanning 50% = no clear difference)
  • Overall quality: 88.3 → 88.3 (/100, 0.0)
  • Bad signals introduced (advisory): 1
  • Blocking-case regressions: 0

Cases

Case Mode Quality Δ Win-rate (90% CI) Safety
dotnet-tracer-repo-nav-integration-001 block -0.1 47% [42%–52%] ok
dotnet-tracer-logging-terminology-004 block -1.0 50% [42%–58%] ok
dotnet-tracer-nuget-scope-hallucination-008 block -1.1 50% [50%–50%] ok
dotnet-tracer-confirm-instrumentation-011 block -1.5 47% [33%–61%] ⚠️ bad signal
dotnet-tracer-control-context-propagation-016 warn +3.8 56% [49%–63%] ok

Per-dimension scores, token usage, latency, and estimated cost are in the CI job logs.

@NachoEchevarria NachoEchevarria changed the title Basic infra Add the LLM Validation gate for AGENTS.md + the libdatadog-bump skill Jul 2, 2026
@NachoEchevarria NachoEchevarria marked this pull request as ready for review July 3, 2026 10:52
@NachoEchevarria NachoEchevarria requested a review from a team as a code owner July 3, 2026 10:52

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d3520a32a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

through everything I have to update in this repo and how to verify it.
expected_criteria:
- Updates BOTH version pins — build/cmake/FindLibdatadog.cmake (Linux/macOS) and the Windows vcpkg port (vcpkg.json + portfile.cmake).
- FindLibdatadog.cmake uses SHA-256 hashes and a v-prefixed version (e.g. v32.0.0); the vcpkg version-string has NO v prefix.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the libdatadog-dotnet version in this benchmark

When .claude/skills/bump-libdatadog/SKILL.md changes, this is the blocking skill case, but the example here uses upstream libdatadog v32.0.0; the existing skill explicitly says the repo pins the separate libdatadog-dotnet release version, and the current CMake pin is v2.0.0. This can make the judge reward an answer that bumps to the wrong GitHub release tag (or penalize the correct upstream-vs-dotnet distinction), so the criterion should use a libdatadog-dotnet example and call out that upstream versions are not the pinned value.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

# List more than one to validate the combined instruction set (e.g. AGENTS.md + a Claude skill file).
instruction_files:
- AGENTS.md
- .claude/skills/bump-libdatadog/SKILL.md # also gate the libdatadog-bump skill

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate the skill's helper script too

When only .claude/skills/bump-libdatadog/scripts/fetch-release-hashes.sh changes, the comments here say the CI trigger derives its watch paths from instruction_files, but the only watched skill path is SKILL.md. A broken checksum-fetching script would therefore bypass dotnet-tracer-bump-libdatadog-013 even though the skill tells agents to run it, so include the whole skill directory or the script path in the watched inputs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment thread .gitlab-ci.yml
# footprint in this repo is the local .llm-validation/ suite. Pinned to the platform repo's `main`, which
# carries ci/ + the CLI on the gitlab.ddbuild.io mirror. See that repo's ci/README.md for configuration.
- project: 'DataDog/llm-validation-platform'
ref: main

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin the external CI template to an immutable ref

For every pipeline after this merge, this include follows the platform repo's moving main branch, so a later template change can alter or break dd-trace-dotnet's merge gate without any change in this repo. Since this job can block PRs, use a tag or commit SHA and bump it deliberately instead of relying on a mutable branch.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:builds project files, build scripts, pipelines, versioning, releases, packages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant