Add CI and SLAs#22
Merged
Merged
Conversation
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…enchmark beforeVariants with buildType check may silently disable the com.android.test variant. The find step reveals actual APK paths on the next run. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
AGP 8.5.1 assembleRelease leaves APKs in build/intermediates/apk/release/ rather than build/outputs/apk/release/. Stage them with find+cp so the upload path is always a single known file and gcloud --app/--test refs hold. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Multiline backslash continuation caused gcloud to treat each env var as a separate CLI argument. Single-quoted string fixes the parsing. pipefail ensures gcloud failures propagate through the tee pipe. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The macrobenchmark module was a JUnit shell whose only function was to let FTL invoke perfetto; its TraceSectionMetric output (avg/min/max only) was discarded anyway since scripts/aggregate-traces.py re-parses the raw traces for p50/p90/p99. Replace it with a ~60-line FrameLatencyCapture instrumentation test in :app/androidTest that drives CameraActivity and shells out perfetto via UiAutomation, mirroring scripts/measure-frame-latency.sh. One fewer Gradle module, no AndroidX Macrobenchmark dependency, identical .pftrace output feeding the existing aggregate-and-gate pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8470bd8 to
21209fa
Compare
perfetto's short-form CLI (-t, -a, positional categories) requires API 31+. FTL only stocks redfin (Pixel 5 / Adreno 620) on Android 11 (API 30), so swap the Adreno job to a52sxq (Galaxy A52s, Adreno 642L) on Android 14 — the next-closest mid-range Snapdragon device available. Bump the Mali job from oriole-32 → oriole-33 (same Pixel 6 hardware, Android 13) for the same reason. While diagnosing the previous CI run on FTL, the Pixel 6 hit a real NPE in CoreEngine::nativeSendCameraFrame: AHardwareBuffer_lock returned non-zero for the camera-side buffer, leaving cpuData null, and the subsequent memcpy SIGSEGV'd. Check both lock return codes + pointer non-null before copying, and drop the frame on failure instead of crashing. Also add an `ls` assertion after each perfetto capture in FrameLatencyCapture — UiAutomation.executeShellCommand swallows exit codes and stderr, so a misbehaving perfetto used to silently pass the test with zero traces produced. The assertion gives us a clear failure with the output-dir listing in the message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The FTL run finally produced .pftrace files but the workflow couldn't find them: --directories-to-pull preserves the full on-device path under the GCS artifacts/ prefix, so /sdcard/Android/media/<pkg>/additional_test_output lands at .../artifacts/sdcard/Android/media/<pkg>/additional_test_output, not .../artifacts/additional_test_output. Update the gsutil cp URL and drop the rsync fallback (it was masking this exact bug). Then on Mali (Pixel 6), every frame was dropped: AHardwareBuffer_lock returned 0 (success) but with a NULL pointer for the GPU buffer side. The buffer was allocated with only GPU_SAMPLED_IMAGE | GPU_FRAMEBUFFER — strict drivers refuse to CPU-map a buffer not allocated CPU-writable and signal that by returning success+null. Add CPU_WRITE_OFTEN to the allocation. Adreno was lenient and worked without it; Mali is strict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gsutil cp -r refuses to copy multiple files into a non-existent destination
("Destination URL must name a directory, bucket, or bucket subdirectory")
even when the dest ends with /. Pre-create the dir with mkdir -p.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compare-baseline.py guards against silent FTL pool swaps by exiting 3 on ftl_model_id mismatch; the adreno baseline still claimed redfin from the pre-device-swap commit and aborted before reaching the placeholder happy path. Sync the placeholder metadata to a52sxq/34 (adreno) and bump the mali android_sdk to 33 to match the version bump. Empty stages, so first real run will still flow through the placeholder branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes building on the now-green Benchmark pipeline:
1. Drop .github/workflows/build.yml — its job is byte-identical to the
build job in benchmark.yml (which already gates PRs). Main-branch push
builds go away; re-add as a separate workflow if/when needed.
2. Per-PR comment with the consolidated p50/p90/p99 delta table:
- compare-baseline.py learns --output-md FILE.
- Each benchmark-{adreno,mali} job writes comparison-*.md alongside
results-*.json and uploads it with the existing artifact.
- New `comment` job runs after both (if: always() so regressions still
show), downloads both artifacts, and upserts a single PR comment via
actions/github-script (marker comment to find-and-update). Merge
gating is unchanged — the benchmark-{adreno,mali} jobs still fail on
regression, so branch protection blocks merge as before.
3. New .github/workflows/baselines.yml — manual workflow_dispatch:
- Optional run_id input (default: latest Benchmark run on the branch).
- Downloads benchmark-results-{adreno,mali}, copies results-*.json over
baseline-*.json, commits to the same branch.
- Next Benchmark run sees a populated baseline and turns green, making
a previously-red "needs baseline refresh" PR mergeable without a
manual commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Frame-latency benchmarkAdreno (Galaxy A52s 5G, Adreno 642L)Frame-latency benchmark results
Baseline: OpenGL ES
Watch-only metrics (16) — informational, never fail the build
Vulkan
Watch-only metrics (16) — informational, never fail the build
Mali (Pixel 6, Mali-G78)Frame-latency benchmark results
Baseline: OpenGL ES
Watch-only metrics (16) — informational, never fail the build
Vulkan
Watch-only metrics (16) — informational, never fail the build
To re-seed baselines from this run, manually trigger the Baselines workflow under Actions → Baselines and pick this branch as the ref. (Only visible after the workflow file lands on the default branch — GitHub limitation for |
Manually seeded what .github/workflows/baselines.yml will do once it's
landed on the default branch (workflow_dispatch isn't available before
that). Real stages now populate baseline-{adreno,mali}.json so the next
Benchmark run produces actual delta percentages in the PR comment
instead of "missing" placeholders.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual equivalent of .github/workflows/baselines.yml (which isn't dispatchable until merged to main). Previous mali run tripped the tight ±5% gate on frame_to_screen.vk.p90 at +5.3% — natural FTL run-to-run variance, no code regression. Reseed baselines from the latest run so the next Benchmark cycle compares against fresh data and the PR can go green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes to the comparison output and gating logic. Gate calibration — sub-ms metrics like frame_native_proc.avg (~0.7 ms baseline) trivially trip the percent gate on noise: a 0.1 ms jitter becomes +14% even though it's far below any frame-budget significance. Add an absolute floor per tier; a metric passes when EITHER |Δ%| ≤ tolerance_pct OR |Δabs| ≤ abs_floor. Real regressions exceed both thresholds, pure relative noise on tiny absolutes is filtered. tight: ±5% AND ±0.5 ms loose: ±10% AND ±0.5 ms (or ±5 frames for dropped_frames counters) PR-comment layout — group metrics by renderer (OpenGL ES vs Vulkan) in two sub-tables with the renderer prefix stripped from the row keys, so the same stage in gl/vk lines up visually for side-by-side reading. New Δabs column next to Δ% makes the absolute jitter obvious at a glance (handy when a flagged metric turns out to be sub-ms noise). Also clarifies the PR-comment text about how to dispatch the Baselines workflow now that the file is finally landing on the default branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After enabling the dual gate, frame_e2e metrics kept tripping with ~1 ms shifts between identical commits on FTL Pixel 6 / Galaxy A52s. Local SM-F936B drifts ~0.25 ms — the source of CLAUDE.md's original calibration — but FTL devices show observably higher between-run jitter, so the 0.5 ms floor was too tight there. 1.5 ms absorbs the empirical FTL noise without giving up regression detection on 13–25 ms baselines (any >1.5 ms slowdown still fails both the percent and absolute bounds). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p99 is the worst 1% of frames per iteration, structurally outlier- sensitive, and after five CI cycles it's empirically the noisiest tight- tier metric on FTL (Pixel 6 hit +8.9% / +1.878 ms even between identical commits, exceeding both tight bounds). avg + p90 — for which CLAUDE.md documents sub-3% CV — stay tight; p99 moves to loose (±10% / ±0.5 ms) where its natural variance fits. Real >10% p99 regressions still fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avg metrics aren't representative — one slow frame in 500+ skews the mean even when steady-state behavior is fine. p90 captures the steady state, p99 the tail, but avg is noisy in a way that mixes both. Move every *.avg metric to the watch tier (still emitted in the PR comment, just never fails the build). The PR comment now shows only gated metrics by default per renderer (p90 tight, p99 loose, to_screen.p90 tight) and collapses watch metrics under a <details><summary>Watch-only metrics (N) — informational, never fail the build</summary> block. Reduces ~38 rows per renderer to ~3 in the default view; reviewers expand if they want to inspect the tail distributions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.