Add CI and SLAs by kiryldz · Pull Request #22 · kiryldz/android-hardware-buffer-camera

kiryldz · 2026-05-27T09:39:35Z

No description provided.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…enchmark beforeVariants with buildType check may silently disable the com.android.test variant. The find step reveals actual APK paths on the next run. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

AGP 8.5.1 assembleRelease leaves APKs in build/intermediates/apk/release/ rather than build/outputs/apk/release/. Stage them with find+cp so the upload path is always a single known file and gcloud --app/--test refs hold. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Multiline backslash continuation caused gcloud to treat each env var as a separate CLI argument. Single-quoted string fixes the parsing. pipefail ensures gcloud failures propagate through the tee pipe. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The macrobenchmark module was a JUnit shell whose only function was to let FTL invoke perfetto; its TraceSectionMetric output (avg/min/max only) was discarded anyway since scripts/aggregate-traces.py re-parses the raw traces for p50/p90/p99. Replace it with a ~60-line FrameLatencyCapture instrumentation test in :app/androidTest that drives CameraActivity and shells out perfetto via UiAutomation, mirroring scripts/measure-frame-latency.sh. One fewer Gradle module, no AndroidX Macrobenchmark dependency, identical .pftrace output feeding the existing aggregate-and-gate pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

perfetto's short-form CLI (-t, -a, positional categories) requires API 31+. FTL only stocks redfin (Pixel 5 / Adreno 620) on Android 11 (API 30), so swap the Adreno job to a52sxq (Galaxy A52s, Adreno 642L) on Android 14 — the next-closest mid-range Snapdragon device available. Bump the Mali job from oriole-32 → oriole-33 (same Pixel 6 hardware, Android 13) for the same reason. While diagnosing the previous CI run on FTL, the Pixel 6 hit a real NPE in CoreEngine::nativeSendCameraFrame: AHardwareBuffer_lock returned non-zero for the camera-side buffer, leaving cpuData null, and the subsequent memcpy SIGSEGV'd. Check both lock return codes + pointer non-null before copying, and drop the frame on failure instead of crashing. Also add an `ls` assertion after each perfetto capture in FrameLatencyCapture — UiAutomation.executeShellCommand swallows exit codes and stderr, so a misbehaving perfetto used to silently pass the test with zero traces produced. The assertion gives us a clear failure with the output-dir listing in the message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The FTL run finally produced .pftrace files but the workflow couldn't find them: --directories-to-pull preserves the full on-device path under the GCS artifacts/ prefix, so /sdcard/Android/media/<pkg>/additional_test_output lands at .../artifacts/sdcard/Android/media/<pkg>/additional_test_output, not .../artifacts/additional_test_output. Update the gsutil cp URL and drop the rsync fallback (it was masking this exact bug). Then on Mali (Pixel 6), every frame was dropped: AHardwareBuffer_lock returned 0 (success) but with a NULL pointer for the GPU buffer side. The buffer was allocated with only GPU_SAMPLED_IMAGE | GPU_FRAMEBUFFER — strict drivers refuse to CPU-map a buffer not allocated CPU-writable and signal that by returning success+null. Add CPU_WRITE_OFTEN to the allocation. Adreno was lenient and worked without it; Mali is strict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gsutil cp -r refuses to copy multiple files into a non-existent destination ("Destination URL must name a directory, bucket, or bucket subdirectory") even when the dest ends with /. Pre-create the dir with mkdir -p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

compare-baseline.py guards against silent FTL pool swaps by exiting 3 on ftl_model_id mismatch; the adreno baseline still claimed redfin from the pre-device-swap commit and aborted before reaching the placeholder happy path. Sync the placeholder metadata to a52sxq/34 (adreno) and bump the mali android_sdk to 33 to match the version bump. Empty stages, so first real run will still flow through the placeholder branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three changes building on the now-green Benchmark pipeline: 1. Drop .github/workflows/build.yml — its job is byte-identical to the build job in benchmark.yml (which already gates PRs). Main-branch push builds go away; re-add as a separate workflow if/when needed. 2. Per-PR comment with the consolidated p50/p90/p99 delta table: - compare-baseline.py learns --output-md FILE. - Each benchmark-{adreno,mali} job writes comparison-*.md alongside results-*.json and uploads it with the existing artifact. - New `comment` job runs after both (if: always() so regressions still show), downloads both artifacts, and upserts a single PR comment via actions/github-script (marker comment to find-and-update). Merge gating is unchanged — the benchmark-{adreno,mali} jobs still fail on regression, so branch protection blocks merge as before. 3. New .github/workflows/baselines.yml — manual workflow_dispatch: - Optional run_id input (default: latest Benchmark run on the branch). - Downloads benchmark-results-{adreno,mali}, copies results-*.json over baseline-*.json, commits to the same branch. - Next Benchmark run sees a populated baseline and turns green, making a previously-red "needs baseline refresh" PR mergeable without a manual commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-28T09:36:16Z

Frame-latency benchmark

Adreno (Galaxy A52s 5G, Adreno 642L)

Frame-latency benchmark results

✅ All gated metrics within tolerance.

Baseline: benchmark/baselines/baseline-adreno.json | Results: results-adreno.json

OpenGL ES

metric	tier	baseline	observed	Δabs	Δ%	status
`dz.frame_e2e.p90`	tight	21.087	20.275	-0.812	-3.9%	✅ pass
`dz.frame_e2e.p99`	loose	23.800	23.009	-0.791	-3.3%	✅ pass
`dz.frame_to_screen.p90`	tight	17.132	16.792	-0.340	-2.0%	✅ pass

Watch-only metrics (16) — informational, never fail the build

metric	tier	baseline	observed	Δabs	Δ%	status
`dz.frame_e2e.avg`	watch	13.929	13.395	-0.534	-3.8%	👁 watch
`dz.frame_native_proc.avg`	watch	0.856	0.845	-0.011	-1.3%	👁 watch
`dz.frame_native_proc.p50`	watch	0.835	0.786	-0.049	-5.9%	👁 watch
`dz.frame_native_proc.p90`	watch	1.322	1.346	+0.024	+1.8%	👁 watch
`dz.frame_native_proc.p99`	watch	2.037	2.129	+0.092	+4.5%	👁 watch
`dz.frame_render.avg`	watch	0.531	0.503	-0.028	-5.3%	👁 watch
`dz.frame_render.p50`	watch	0.503	0.459	-0.044	-8.7%	👁 watch
`dz.frame_render.p90`	watch	0.847	0.828	-0.019	-2.2%	👁 watch
`dz.frame_render.p99`	watch	1.555	1.292	-0.263	-16.9%	👁 watch
`dz.frame_to_native.avg`	watch	2.238	2.152	-0.086	-3.8%	👁 watch
`dz.frame_to_native.max`	watch	12.714	8.448	-4.266	-33.6%	👁 watch
`dz.frame_to_native.p50`	watch	2.016	1.940	-0.076	-3.8%	👁 watch
`dz.frame_to_native.p90`	watch	3.278	2.972	-0.306	-9.3%	👁 watch
`dz.frame_to_native.p99`	watch	5.339	4.895	-0.444	-8.3%	👁 watch
`dz.frame_to_screen.avg`	watch	10.791	10.357	-0.434	-4.0%	👁 watch
`dz.frame_to_screen.p99`	watch	19.397	18.886	-0.511	-2.6%	👁 watch

Vulkan

metric	tier	baseline	observed	Δabs	Δ%	status
`dz.frame_e2e.p90`	tight	21.597	21.630	+0.033	+0.2%	✅ pass
`dz.frame_e2e.p99`	loose	25.001	24.432	-0.569	-2.3%	✅ pass
`dz.frame_to_screen.p90`	tight	18.048	17.725	-0.323	-1.8%	✅ pass

Watch-only metrics (16) — informational, never fail the build

metric	tier	baseline	observed	Δabs	Δ%	status
`dz.frame_e2e.avg`	watch	14.857	14.782	-0.075	-0.5%	👁 watch
`dz.frame_native_proc.avg`	watch	1.307	1.296	-0.011	-0.8%	👁 watch
`dz.frame_native_proc.p50`	watch	1.158	1.096	-0.062	-5.4%	👁 watch
`dz.frame_native_proc.p90`	watch	2.201	2.242	+0.041	+1.9%	👁 watch
`dz.frame_native_proc.p99`	watch	2.863	2.879	+0.016	+0.6%	👁 watch
`dz.frame_render.avg`	watch	1.464	1.494	+0.030	+2.0%	👁 watch
`dz.frame_render.p50`	watch	1.409	1.463	+0.054	+3.8%	👁 watch
`dz.frame_render.p90`	watch	1.915	1.977	+0.062	+3.2%	👁 watch
`dz.frame_render.p99`	watch	2.421	2.973	+0.552	+22.8%	👁 watch
`dz.frame_to_native.avg`	watch	2.126	2.124	-0.002	-0.1%	👁 watch
`dz.frame_to_native.max`	watch	6.117	6.945	+0.828	+13.5%	👁 watch
`dz.frame_to_native.p50`	watch	1.896	1.865	-0.031	-1.6%	👁 watch
`dz.frame_to_native.p90`	watch	3.156	3.198	+0.042	+1.3%	👁 watch
`dz.frame_to_native.p99`	watch	5.085	4.794	-0.291	-5.7%	👁 watch
`dz.frame_to_screen.avg`	watch	11.387	11.320	-0.067	-0.6%	👁 watch
`dz.frame_to_screen.p99`	watch	20.157	19.787	-0.370	-1.8%	👁 watch

Mali (Pixel 6, Mali-G78)

Frame-latency benchmark results

✅ All gated metrics within tolerance.

Baseline: benchmark/baselines/baseline-mali.json | Results: results-mali.json

OpenGL ES

metric	tier	baseline	observed	Δabs	Δ%	status
`dz.frame_e2e.p90`	tight	18.973	18.654	-0.319	-1.7%	✅ pass
`dz.frame_e2e.p99`	loose	21.117	23.136	+2.019	+9.6%	✅ pass
`dz.frame_to_screen.p90`	tight	17.472	17.081	-0.391	-2.2%	✅ pass

Watch-only metrics (16) — informational, never fail the build

metric	tier	baseline	observed	Δabs	Δ%	status
`dz.frame_e2e.avg`	watch	12.685	13.438	+0.753	+5.9%	👁 watch
`dz.frame_native_proc.avg`	watch	0.772	0.807	+0.035	+4.5%	👁 watch
`dz.frame_native_proc.p50`	watch	0.636	0.717	+0.081	+12.7%	👁 watch
`dz.frame_native_proc.p90`	watch	1.095	1.112	+0.017	+1.6%	👁 watch
`dz.frame_native_proc.p99`	watch	2.844	2.554	-0.290	-10.2%	👁 watch
`dz.frame_render.avg`	watch	0.861	0.881	+0.020	+2.3%	👁 watch
`dz.frame_render.p50`	watch	0.681	0.709	+0.028	+4.1%	👁 watch
`dz.frame_render.p90`	watch	1.552	1.590	+0.038	+2.4%	👁 watch
`dz.frame_render.p99`	watch	2.296	3.233	+0.937	+40.8%	👁 watch
`dz.frame_to_native.avg`	watch	0.853	0.853	+0.000	+0.0%	👁 watch
`dz.frame_to_native.max`	watch	2.425	3.140	+0.715	+29.5%	👁 watch
`dz.frame_to_native.p50`	watch	0.780	0.777	-0.003	-0.4%	👁 watch
`dz.frame_to_native.p90`	watch	1.275	1.352	+0.077	+6.0%	👁 watch
`dz.frame_to_native.p99`	watch	1.943	2.505	+0.562	+28.9%	👁 watch
`dz.frame_to_screen.avg`	watch	11.029	11.749	+0.720	+6.5%	👁 watch
`dz.frame_to_screen.p99`	watch	19.233	19.859	+0.626	+3.3%	👁 watch

Vulkan

metric	tier	baseline	observed	Δabs	Δ%	status
`dz.frame_e2e.p90`	tight	20.135	19.880	-0.255	-1.3%	✅ pass
`dz.frame_e2e.p99`	loose	22.355	24.222	+1.867	+8.4%	✅ pass
`dz.frame_to_screen.p90`	tight	18.290	17.907	-0.383	-2.1%	✅ pass

Watch-only metrics (16) — informational, never fail the build

metric	tier	baseline	observed	Δabs	Δ%	status
`dz.frame_e2e.avg`	watch	13.786	14.514	+0.728	+5.3%	👁 watch
`dz.frame_native_proc.avg`	watch	1.064	1.152	+0.088	+8.3%	👁 watch
`dz.frame_native_proc.p50`	watch	0.900	1.087	+0.187	+20.8%	👁 watch
`dz.frame_native_proc.p90`	watch	1.463	1.662	+0.199	+13.6%	👁 watch
`dz.frame_native_proc.p99`	watch	2.462	2.804	+0.342	+13.9%	👁 watch
`dz.frame_render.avg`	watch	1.861	1.846	-0.015	-0.8%	👁 watch
`dz.frame_render.p50`	watch	1.796	1.753	-0.043	-2.4%	👁 watch
`dz.frame_render.p90`	watch	2.580	2.661	+0.081	+3.1%	👁 watch
`dz.frame_render.p99`	watch	4.060	4.290	+0.230	+5.7%	👁 watch
`dz.frame_to_native.avg`	watch	0.886	0.861	-0.025	-2.8%	👁 watch
`dz.frame_to_native.max`	watch	4.330	2.374	-1.956	-45.2%	👁 watch
`dz.frame_to_native.p50`	watch	0.811	0.829	+0.018	+2.2%	👁 watch
`dz.frame_to_native.p90`	watch	1.309	1.285	-0.024	-1.8%	👁 watch
`dz.frame_to_native.p99`	watch	2.179	1.845	-0.334	-15.3%	👁 watch
`dz.frame_to_screen.avg`	watch	11.790	12.490	+0.700	+5.9%	👁 watch
`dz.frame_to_screen.p99`	watch	19.701	20.486	+0.785	+4.0%	👁 watch

To re-seed baselines from this run, manually trigger the Baselines workflow under Actions → Baselines and pick this branch as the ref. (Only visible after the workflow file lands on the default branch — GitHub limitation for workflow_dispatch.)

Manually seeded what .github/workflows/baselines.yml will do once it's landed on the default branch (workflow_dispatch isn't available before that). Real stages now populate baseline-{adreno,mali}.json so the next Benchmark run produces actual delta percentages in the PR comment instead of "missing" placeholders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Manual equivalent of .github/workflows/baselines.yml (which isn't dispatchable until merged to main). Previous mali run tripped the tight ±5% gate on frame_to_screen.vk.p90 at +5.3% — natural FTL run-to-run variance, no code regression. Reseed baselines from the latest run so the next Benchmark cycle compares against fresh data and the PR can go green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes to the comparison output and gating logic. Gate calibration — sub-ms metrics like frame_native_proc.avg (~0.7 ms baseline) trivially trip the percent gate on noise: a 0.1 ms jitter becomes +14% even though it's far below any frame-budget significance. Add an absolute floor per tier; a metric passes when EITHER |Δ%| ≤ tolerance_pct OR |Δabs| ≤ abs_floor. Real regressions exceed both thresholds, pure relative noise on tiny absolutes is filtered. tight: ±5% AND ±0.5 ms loose: ±10% AND ±0.5 ms (or ±5 frames for dropped_frames counters) PR-comment layout — group metrics by renderer (OpenGL ES vs Vulkan) in two sub-tables with the renderer prefix stripped from the row keys, so the same stage in gl/vk lines up visually for side-by-side reading. New Δabs column next to Δ% makes the absolute jitter obvious at a glance (handy when a flagged metric turns out to be sub-ms noise). Also clarifies the PR-comment text about how to dispatch the Baselines workflow now that the file is finally landing on the default branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After enabling the dual gate, frame_e2e metrics kept tripping with ~1 ms shifts between identical commits on FTL Pixel 6 / Galaxy A52s. Local SM-F936B drifts ~0.25 ms — the source of CLAUDE.md's original calibration — but FTL devices show observably higher between-run jitter, so the 0.5 ms floor was too tight there. 1.5 ms absorbs the empirical FTL noise without giving up regression detection on 13–25 ms baselines (any >1.5 ms slowdown still fails both the percent and absolute bounds). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

p99 is the worst 1% of frames per iteration, structurally outlier- sensitive, and after five CI cycles it's empirically the noisiest tight- tier metric on FTL (Pixel 6 hit +8.9% / +1.878 ms even between identical commits, exceeding both tight bounds). avg + p90 — for which CLAUDE.md documents sub-3% CV — stay tight; p99 moves to loose (±10% / ±0.5 ms) where its natural variance fits. Real >10% p99 regressions still fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

avg metrics aren't representative — one slow frame in 500+ skews the mean even when steady-state behavior is fine. p90 captures the steady state, p99 the tail, but avg is noisy in a way that mixes both. Move every *.avg metric to the watch tier (still emitted in the PR comment, just never fails the build). The PR comment now shows only gated metrics by default per renderer (p90 tight, p99 loose, to_screen.p90 tight) and collapses watch metrics under a <details><summary>Watch-only metrics (N) — informational, never fail the build</summary> block. Reduces ~38 rows per renderer to ~3 in the default view; reviewers expand if they want to inspect the tail distributions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add CI and SLAs

2d8e7fc

kiryldz self-assigned this May 27, 2026

kiryldz and others added 5 commits May 27, 2026 12:43

benchmark: opt in to ExperimentalMetricApi for TraceSectionMetric

b11af43

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

kiryldz force-pushed the kdz-macrobenchmark-ci branch from 8470bd8 to 21209fa Compare May 27, 2026 19:18

kiryldz and others added 5 commits May 27, 2026 23:17

kiryldz and others added 6 commits May 28, 2026 13:13

kiryldz merged commit a3fa5ef into main May 28, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CI and SLAs#22

Add CI and SLAs#22
kiryldz merged 17 commits into
mainfrom
kdz-macrobenchmark-ci

kiryldz commented May 27, 2026

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kiryldz commented May 27, 2026

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Frame-latency benchmark

Adreno (Galaxy A52s 5G, Adreno 642L)

Frame-latency benchmark results

OpenGL ES

Vulkan

Mali (Pixel 6, Mali-G78)

Frame-latency benchmark results

OpenGL ES

Vulkan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 28, 2026 •

edited

Loading