feat(check): action-parity metric engine (MMD two-sample + embodied)#198
Open
rylinjames wants to merge 1 commit into
Open
feat(check): action-parity metric engine (MMD two-sample + embodied)#198rylinjames wants to merge 1 commit into
rylinjames wants to merge 1 commit into
Conversation
…metrics The v1 core of reflex verify, pure-NumPy (no GPU/Modal). A distributional two-sample test (MMD / energy / binned-KL) with an FPR-controlled permutation p-value decides whether an optimized policy's action distribution matches the original's (Model Equality Testing 2410.20247; MMD locked per ADR 2026-05-31-parity-metric-mmd-provisional). Plus embodied metrics (jerk / motion-energy / path-length, 2603.19131) that catch regressions task-success hides. compute_parity() returns a structured ParityVerdict. Validates the known-good vs known-bad discrimination the runaway Modal spike was meant to confirm — now 9 deterministic unit tests, $0 spend: MMD rejects a shifted/broken optimization (p<0.05) and passes a faithful one (p>=0.05); the embodied gate flags a jerkier trajectory; FPR is bounded on same-distribution data. The Modal bake-off now only needs to confirm correlation with REAL-robot success. Next: wire this into verify.py (replaces PR #196's sentinel gates) + the rollout-success non-inferiority tier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The v1 core of
reflex verify— the action-parity engine. Pure NumPy, no GPU/Modal, so the verdict math runs anywhere and is fully unit-testable.What this is
A VLA policy samples actions, so per-sample
atol/MSE between an original and an optimized policy anti-correlates with real-robot success. The right question is distributional. This module answers it:reflex/check/metrics.pymmd2_rbf(multi-bandwidth, median-heuristic),energy_distance,binned_kl.two_sample_test(...)— permutation test, FPR-controlled p-value. H0 = "same distribution";p < alpha⇒ the optimized action distribution differs ⇒ parity broken. (Model Equality Testing, arXiv 2410.20247; MMD locked per ADR2026-05-31-parity-metric-mmd-provisional.)jerk_rms,motion_energy,path_length(arXiv 2603.19131) — catch regressions aggregate task-success hides.compute_parity(...) -> ParityVerdict— combines the distribution gate + an embodied-regression flag into a structured, serializable verdict.Why it matters
It validates the known-good vs known-bad discrimination the (runaway, ~$48) Modal spike was meant to confirm — now as 9 fast deterministic unit tests, $0 spend:
p<0.05) and passes a faithful one (p>=0.05).So the metric choice is de-risked cheaply; the Modal bake-off now only needs to confirm correlation with real-robot success.
Honest scope
TODO(reflex-verify)) closes that and is what ties to real-robot success. Marked, not hidden.verify.py(replacing PR feat: reflex verify + comply engine (parity gate, signed cert, EU conformity pack) #196's sentinel gates) is the next PR.Test plan
9 passed—pytest tests/test_check_metrics.py🤖 Generated with Claude Code