Skip to content

Add CI and SLAs#22

Merged
kiryldz merged 17 commits into
mainfrom
kdz-macrobenchmark-ci
May 28, 2026
Merged

Add CI and SLAs#22
kiryldz merged 17 commits into
mainfrom
kdz-macrobenchmark-ci

Conversation

@kiryldz

@kiryldz kiryldz commented May 27, 2026

Copy link
Copy Markdown
Owner

No description provided.

@kiryldz kiryldz self-assigned this May 27, 2026
kiryldz and others added 5 commits May 27, 2026 12:43
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…enchmark

beforeVariants with buildType check may silently disable the com.android.test
variant. The find step reveals actual APK paths on the next run.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
AGP 8.5.1 assembleRelease leaves APKs in build/intermediates/apk/release/
rather than build/outputs/apk/release/. Stage them with find+cp so the
upload path is always a single known file and gcloud --app/--test refs hold.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Multiline backslash continuation caused gcloud to treat each env var as
a separate CLI argument. Single-quoted string fixes the parsing. pipefail
ensures gcloud failures propagate through the tee pipe.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The macrobenchmark module was a JUnit shell whose only function was to let
FTL invoke perfetto; its TraceSectionMetric output (avg/min/max only) was
discarded anyway since scripts/aggregate-traces.py re-parses the raw traces
for p50/p90/p99. Replace it with a ~60-line FrameLatencyCapture
instrumentation test in :app/androidTest that drives CameraActivity and
shells out perfetto via UiAutomation, mirroring scripts/measure-frame-latency.sh.

One fewer Gradle module, no AndroidX Macrobenchmark dependency, identical
.pftrace output feeding the existing aggregate-and-gate pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kiryldz kiryldz force-pushed the kdz-macrobenchmark-ci branch from 8470bd8 to 21209fa Compare May 27, 2026 19:18
kiryldz and others added 5 commits May 27, 2026 23:17
perfetto's short-form CLI (-t, -a, positional categories) requires API 31+.
FTL only stocks redfin (Pixel 5 / Adreno 620) on Android 11 (API 30), so
swap the Adreno job to a52sxq (Galaxy A52s, Adreno 642L) on Android 14 —
the next-closest mid-range Snapdragon device available. Bump the Mali job
from oriole-32 → oriole-33 (same Pixel 6 hardware, Android 13) for the
same reason.

While diagnosing the previous CI run on FTL, the Pixel 6 hit a real NPE in
CoreEngine::nativeSendCameraFrame: AHardwareBuffer_lock returned non-zero
for the camera-side buffer, leaving cpuData null, and the subsequent
memcpy SIGSEGV'd. Check both lock return codes + pointer non-null before
copying, and drop the frame on failure instead of crashing.

Also add an `ls` assertion after each perfetto capture in
FrameLatencyCapture — UiAutomation.executeShellCommand swallows exit codes
and stderr, so a misbehaving perfetto used to silently pass the test with
zero traces produced. The assertion gives us a clear failure with the
output-dir listing in the message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The FTL run finally produced .pftrace files but the workflow couldn't find
them: --directories-to-pull preserves the full on-device path under the
GCS artifacts/ prefix, so /sdcard/Android/media/<pkg>/additional_test_output
lands at .../artifacts/sdcard/Android/media/<pkg>/additional_test_output,
not .../artifacts/additional_test_output. Update the gsutil cp URL and
drop the rsync fallback (it was masking this exact bug).

Then on Mali (Pixel 6), every frame was dropped: AHardwareBuffer_lock
returned 0 (success) but with a NULL pointer for the GPU buffer side. The
buffer was allocated with only GPU_SAMPLED_IMAGE | GPU_FRAMEBUFFER —
strict drivers refuse to CPU-map a buffer not allocated CPU-writable and
signal that by returning success+null. Add CPU_WRITE_OFTEN to the
allocation. Adreno was lenient and worked without it; Mali is strict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gsutil cp -r refuses to copy multiple files into a non-existent destination
("Destination URL must name a directory, bucket, or bucket subdirectory")
even when the dest ends with /. Pre-create the dir with mkdir -p.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compare-baseline.py guards against silent FTL pool swaps by exiting 3 on
ftl_model_id mismatch; the adreno baseline still claimed redfin from the
pre-device-swap commit and aborted before reaching the placeholder happy
path. Sync the placeholder metadata to a52sxq/34 (adreno) and bump the
mali android_sdk to 33 to match the version bump. Empty stages, so first
real run will still flow through the placeholder branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes building on the now-green Benchmark pipeline:

1. Drop .github/workflows/build.yml — its job is byte-identical to the
   build job in benchmark.yml (which already gates PRs). Main-branch push
   builds go away; re-add as a separate workflow if/when needed.

2. Per-PR comment with the consolidated p50/p90/p99 delta table:
   - compare-baseline.py learns --output-md FILE.
   - Each benchmark-{adreno,mali} job writes comparison-*.md alongside
     results-*.json and uploads it with the existing artifact.
   - New `comment` job runs after both (if: always() so regressions still
     show), downloads both artifacts, and upserts a single PR comment via
     actions/github-script (marker comment to find-and-update). Merge
     gating is unchanged — the benchmark-{adreno,mali} jobs still fail on
     regression, so branch protection blocks merge as before.

3. New .github/workflows/baselines.yml — manual workflow_dispatch:
   - Optional run_id input (default: latest Benchmark run on the branch).
   - Downloads benchmark-results-{adreno,mali}, copies results-*.json over
     baseline-*.json, commits to the same branch.
   - Next Benchmark run sees a populated baseline and turns green, making
     a previously-red "needs baseline refresh" PR mergeable without a
     manual commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

Frame-latency benchmark

Adreno (Galaxy A52s 5G, Adreno 642L)

Frame-latency benchmark results

✅ All gated metrics within tolerance.

Baseline: benchmark/baselines/baseline-adreno.json | Results: results-adreno.json

OpenGL ES

metric tier baseline observed Δabs Δ% status
dz.frame_e2e.p90 tight 21.087 20.275 -0.812 -3.9% ✅ pass
dz.frame_e2e.p99 loose 23.800 23.009 -0.791 -3.3% ✅ pass
dz.frame_to_screen.p90 tight 17.132 16.792 -0.340 -2.0% ✅ pass
Watch-only metrics (16) — informational, never fail the build
metric tier baseline observed Δabs Δ% status
dz.frame_e2e.avg watch 13.929 13.395 -0.534 -3.8% 👁 watch
dz.frame_native_proc.avg watch 0.856 0.845 -0.011 -1.3% 👁 watch
dz.frame_native_proc.p50 watch 0.835 0.786 -0.049 -5.9% 👁 watch
dz.frame_native_proc.p90 watch 1.322 1.346 +0.024 +1.8% 👁 watch
dz.frame_native_proc.p99 watch 2.037 2.129 +0.092 +4.5% 👁 watch
dz.frame_render.avg watch 0.531 0.503 -0.028 -5.3% 👁 watch
dz.frame_render.p50 watch 0.503 0.459 -0.044 -8.7% 👁 watch
dz.frame_render.p90 watch 0.847 0.828 -0.019 -2.2% 👁 watch
dz.frame_render.p99 watch 1.555 1.292 -0.263 -16.9% 👁 watch
dz.frame_to_native.avg watch 2.238 2.152 -0.086 -3.8% 👁 watch
dz.frame_to_native.max watch 12.714 8.448 -4.266 -33.6% 👁 watch
dz.frame_to_native.p50 watch 2.016 1.940 -0.076 -3.8% 👁 watch
dz.frame_to_native.p90 watch 3.278 2.972 -0.306 -9.3% 👁 watch
dz.frame_to_native.p99 watch 5.339 4.895 -0.444 -8.3% 👁 watch
dz.frame_to_screen.avg watch 10.791 10.357 -0.434 -4.0% 👁 watch
dz.frame_to_screen.p99 watch 19.397 18.886 -0.511 -2.6% 👁 watch

Vulkan

metric tier baseline observed Δabs Δ% status
dz.frame_e2e.p90 tight 21.597 21.630 +0.033 +0.2% ✅ pass
dz.frame_e2e.p99 loose 25.001 24.432 -0.569 -2.3% ✅ pass
dz.frame_to_screen.p90 tight 18.048 17.725 -0.323 -1.8% ✅ pass
Watch-only metrics (16) — informational, never fail the build
metric tier baseline observed Δabs Δ% status
dz.frame_e2e.avg watch 14.857 14.782 -0.075 -0.5% 👁 watch
dz.frame_native_proc.avg watch 1.307 1.296 -0.011 -0.8% 👁 watch
dz.frame_native_proc.p50 watch 1.158 1.096 -0.062 -5.4% 👁 watch
dz.frame_native_proc.p90 watch 2.201 2.242 +0.041 +1.9% 👁 watch
dz.frame_native_proc.p99 watch 2.863 2.879 +0.016 +0.6% 👁 watch
dz.frame_render.avg watch 1.464 1.494 +0.030 +2.0% 👁 watch
dz.frame_render.p50 watch 1.409 1.463 +0.054 +3.8% 👁 watch
dz.frame_render.p90 watch 1.915 1.977 +0.062 +3.2% 👁 watch
dz.frame_render.p99 watch 2.421 2.973 +0.552 +22.8% 👁 watch
dz.frame_to_native.avg watch 2.126 2.124 -0.002 -0.1% 👁 watch
dz.frame_to_native.max watch 6.117 6.945 +0.828 +13.5% 👁 watch
dz.frame_to_native.p50 watch 1.896 1.865 -0.031 -1.6% 👁 watch
dz.frame_to_native.p90 watch 3.156 3.198 +0.042 +1.3% 👁 watch
dz.frame_to_native.p99 watch 5.085 4.794 -0.291 -5.7% 👁 watch
dz.frame_to_screen.avg watch 11.387 11.320 -0.067 -0.6% 👁 watch
dz.frame_to_screen.p99 watch 20.157 19.787 -0.370 -1.8% 👁 watch

Mali (Pixel 6, Mali-G78)

Frame-latency benchmark results

✅ All gated metrics within tolerance.

Baseline: benchmark/baselines/baseline-mali.json | Results: results-mali.json

OpenGL ES

metric tier baseline observed Δabs Δ% status
dz.frame_e2e.p90 tight 18.973 18.654 -0.319 -1.7% ✅ pass
dz.frame_e2e.p99 loose 21.117 23.136 +2.019 +9.6% ✅ pass
dz.frame_to_screen.p90 tight 17.472 17.081 -0.391 -2.2% ✅ pass
Watch-only metrics (16) — informational, never fail the build
metric tier baseline observed Δabs Δ% status
dz.frame_e2e.avg watch 12.685 13.438 +0.753 +5.9% 👁 watch
dz.frame_native_proc.avg watch 0.772 0.807 +0.035 +4.5% 👁 watch
dz.frame_native_proc.p50 watch 0.636 0.717 +0.081 +12.7% 👁 watch
dz.frame_native_proc.p90 watch 1.095 1.112 +0.017 +1.6% 👁 watch
dz.frame_native_proc.p99 watch 2.844 2.554 -0.290 -10.2% 👁 watch
dz.frame_render.avg watch 0.861 0.881 +0.020 +2.3% 👁 watch
dz.frame_render.p50 watch 0.681 0.709 +0.028 +4.1% 👁 watch
dz.frame_render.p90 watch 1.552 1.590 +0.038 +2.4% 👁 watch
dz.frame_render.p99 watch 2.296 3.233 +0.937 +40.8% 👁 watch
dz.frame_to_native.avg watch 0.853 0.853 +0.000 +0.0% 👁 watch
dz.frame_to_native.max watch 2.425 3.140 +0.715 +29.5% 👁 watch
dz.frame_to_native.p50 watch 0.780 0.777 -0.003 -0.4% 👁 watch
dz.frame_to_native.p90 watch 1.275 1.352 +0.077 +6.0% 👁 watch
dz.frame_to_native.p99 watch 1.943 2.505 +0.562 +28.9% 👁 watch
dz.frame_to_screen.avg watch 11.029 11.749 +0.720 +6.5% 👁 watch
dz.frame_to_screen.p99 watch 19.233 19.859 +0.626 +3.3% 👁 watch

Vulkan

metric tier baseline observed Δabs Δ% status
dz.frame_e2e.p90 tight 20.135 19.880 -0.255 -1.3% ✅ pass
dz.frame_e2e.p99 loose 22.355 24.222 +1.867 +8.4% ✅ pass
dz.frame_to_screen.p90 tight 18.290 17.907 -0.383 -2.1% ✅ pass
Watch-only metrics (16) — informational, never fail the build
metric tier baseline observed Δabs Δ% status
dz.frame_e2e.avg watch 13.786 14.514 +0.728 +5.3% 👁 watch
dz.frame_native_proc.avg watch 1.064 1.152 +0.088 +8.3% 👁 watch
dz.frame_native_proc.p50 watch 0.900 1.087 +0.187 +20.8% 👁 watch
dz.frame_native_proc.p90 watch 1.463 1.662 +0.199 +13.6% 👁 watch
dz.frame_native_proc.p99 watch 2.462 2.804 +0.342 +13.9% 👁 watch
dz.frame_render.avg watch 1.861 1.846 -0.015 -0.8% 👁 watch
dz.frame_render.p50 watch 1.796 1.753 -0.043 -2.4% 👁 watch
dz.frame_render.p90 watch 2.580 2.661 +0.081 +3.1% 👁 watch
dz.frame_render.p99 watch 4.060 4.290 +0.230 +5.7% 👁 watch
dz.frame_to_native.avg watch 0.886 0.861 -0.025 -2.8% 👁 watch
dz.frame_to_native.max watch 4.330 2.374 -1.956 -45.2% 👁 watch
dz.frame_to_native.p50 watch 0.811 0.829 +0.018 +2.2% 👁 watch
dz.frame_to_native.p90 watch 1.309 1.285 -0.024 -1.8% 👁 watch
dz.frame_to_native.p99 watch 2.179 1.845 -0.334 -15.3% 👁 watch
dz.frame_to_screen.avg watch 11.790 12.490 +0.700 +5.9% 👁 watch
dz.frame_to_screen.p99 watch 19.701 20.486 +0.785 +4.0% 👁 watch

To re-seed baselines from this run, manually trigger the Baselines workflow under Actions → Baselines and pick this branch as the ref. (Only visible after the workflow file lands on the default branch — GitHub limitation for workflow_dispatch.)

kiryldz and others added 6 commits May 28, 2026 13:13
Manually seeded what .github/workflows/baselines.yml will do once it's
landed on the default branch (workflow_dispatch isn't available before
that). Real stages now populate baseline-{adreno,mali}.json so the next
Benchmark run produces actual delta percentages in the PR comment
instead of "missing" placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual equivalent of .github/workflows/baselines.yml (which isn't
dispatchable until merged to main). Previous mali run tripped the tight
±5% gate on frame_to_screen.vk.p90 at +5.3% — natural FTL run-to-run
variance, no code regression. Reseed baselines from the latest run so the
next Benchmark cycle compares against fresh data and the PR can go green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes to the comparison output and gating logic.

Gate calibration — sub-ms metrics like frame_native_proc.avg (~0.7 ms
baseline) trivially trip the percent gate on noise: a 0.1 ms jitter
becomes +14% even though it's far below any frame-budget significance.
Add an absolute floor per tier; a metric passes when EITHER |Δ%| ≤
tolerance_pct OR |Δabs| ≤ abs_floor. Real regressions exceed both
thresholds, pure relative noise on tiny absolutes is filtered.

  tight: ±5%  AND ±0.5 ms
  loose: ±10% AND ±0.5 ms (or ±5 frames for dropped_frames counters)

PR-comment layout — group metrics by renderer (OpenGL ES vs Vulkan)
in two sub-tables with the renderer prefix stripped from the row keys,
so the same stage in gl/vk lines up visually for side-by-side reading.
New Δabs column next to Δ% makes the absolute jitter obvious at a glance
(handy when a flagged metric turns out to be sub-ms noise).

Also clarifies the PR-comment text about how to dispatch the Baselines
workflow now that the file is finally landing on the default branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After enabling the dual gate, frame_e2e metrics kept tripping with ~1 ms
shifts between identical commits on FTL Pixel 6 / Galaxy A52s. Local
SM-F936B drifts ~0.25 ms — the source of CLAUDE.md's original calibration
— but FTL devices show observably higher between-run jitter, so the
0.5 ms floor was too tight there. 1.5 ms absorbs the empirical FTL noise
without giving up regression detection on 13–25 ms baselines (any
>1.5 ms slowdown still fails both the percent and absolute bounds).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p99 is the worst 1% of frames per iteration, structurally outlier-
sensitive, and after five CI cycles it's empirically the noisiest tight-
tier metric on FTL (Pixel 6 hit +8.9% / +1.878 ms even between identical
commits, exceeding both tight bounds). avg + p90 — for which CLAUDE.md
documents sub-3% CV — stay tight; p99 moves to loose (±10% / ±0.5 ms)
where its natural variance fits. Real >10% p99 regressions still fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avg metrics aren't representative — one slow frame in 500+ skews the mean
even when steady-state behavior is fine. p90 captures the steady state,
p99 the tail, but avg is noisy in a way that mixes both. Move every
*.avg metric to the watch tier (still emitted in the PR comment, just
never fails the build).

The PR comment now shows only gated metrics by default per renderer
(p90 tight, p99 loose, to_screen.p90 tight) and collapses watch metrics
under a <details><summary>Watch-only metrics (N) — informational, never
fail the build</summary> block. Reduces ~38 rows per renderer to ~3 in
the default view; reviewers expand if they want to inspect the tail
distributions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kiryldz kiryldz merged commit a3fa5ef into main May 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant