feat(check): action-parity metric engine (MMD two-sample + embodied) by rylinjames · Pull Request #198 · FastCrest/tether

rylinjames · 2026-05-31T23:37:49Z

The v1 core of reflex verify — the action-parity engine. Pure NumPy, no GPU/Modal, so the verdict math runs anywhere and is fully unit-testable.

What this is

A VLA policy samples actions, so per-sample atol/MSE between an original and an optimized policy anti-correlates with real-robot success. The right question is distributional. This module answers it:

reflex/check/metrics.py
- Distributional two-sample tests — mmd2_rbf (multi-bandwidth, median-heuristic), energy_distance, binned_kl.
- two_sample_test(...) — permutation test, FPR-controlled p-value. H0 = "same distribution"; p < alpha ⇒ the optimized action distribution differs ⇒ parity broken. (Model Equality Testing, arXiv 2410.20247; MMD locked per ADR 2026-05-31-parity-metric-mmd-provisional.)
- Embodied metrics — jerk_rms, motion_energy, path_length (arXiv 2603.19131) — catch regressions aggregate task-success hides.
- compute_parity(...) -> ParityVerdict — combines the distribution gate + an embodied-regression flag into a structured, serializable verdict.

Why it matters

It validates the known-good vs known-bad discrimination the (runaway, ~$48) Modal spike was meant to confirm — now as 9 fast deterministic unit tests, $0 spend:

MMD rejects a shifted/broken optimization (p<0.05) and passes a faithful one (p>=0.05).
The embodied gate flags a materially jerkier trajectory.
FPR is bounded on same-distribution data (the test doesn't over-reject).

So the metric choice is de-risked cheaply; the Modal bake-off now only needs to confirm correlation with real-robot success.

Honest scope

"No detectable difference at alpha" is not a proof of equivalence — the rollout-success non-inferiority tier (TODO(reflex-verify)) closes that and is what ties to real-robot success. Marked, not hidden.
This is the engine; wiring it into verify.py (replacing PR feat: reflex verify + comply engine (parity gate, signed cert, EU conformity pack) #196's sentinel gates) is the next PR.

Test plan

9 passed — pytest tests/test_check_metrics.py
ruff clean
No new dependency (NumPy only); no other files touched

🤖 Generated with Claude Code

…metrics The v1 core of reflex verify, pure-NumPy (no GPU/Modal). A distributional two-sample test (MMD / energy / binned-KL) with an FPR-controlled permutation p-value decides whether an optimized policy's action distribution matches the original's (Model Equality Testing 2410.20247; MMD locked per ADR 2026-05-31-parity-metric-mmd-provisional). Plus embodied metrics (jerk / motion-energy / path-length, 2603.19131) that catch regressions task-success hides. compute_parity() returns a structured ParityVerdict. Validates the known-good vs known-bad discrimination the runaway Modal spike was meant to confirm — now 9 deterministic unit tests, $0 spend: MMD rejects a shifted/broken optimization (p<0.05) and passes a faithful one (p>=0.05); the embodied gate flags a jerkier trajectory; FPR is bounded on same-distribution data. The Modal bake-off now only needs to confirm correlation with REAL-robot success. Next: wire this into verify.py (replaces PR #196's sentinel gates) + the rollout-success non-inferiority tier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(check): action-parity metric engine (MMD two-sample + embodied)#198

feat(check): action-parity metric engine (MMD two-sample + embodied)#198
rylinjames wants to merge 1 commit into
mainfrom
feat/parity-metric-engine

rylinjames commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rylinjames commented May 31, 2026

What this is

Why it matters

Honest scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant