Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
131 commits
Select commit Hold shift + click to select a range
671e50d
feat: add Fleet Task environment for skyrl-gym
Mar 28, 2026
35d9513
feat: add Fleet training integration with entrypoints, scripts, and c…
Mar 28, 2026
ae7934e
Add task generation environment for skyrl-gym
Mar 28, 2026
91776e1
Add hint augmentation support for Fleet task training
Mar 28, 2026
f1c3f1b
merge: resolve fleet_task + task_gen registration conflict
Mar 29, 2026
66bf318
Merge remote-tracking branch 'origin/fleet/training' into fleet/all
Mar 29, 2026
73251e1
merge: resolve hint-augmentation + task_gen config conflict
Mar 29, 2026
ff094d9
Add VL/CUA multimodal support ported from SkyRL PR #288
Mar 29, 2026
1af0a8e
Add GCP spot H200 option for VL YAML
Mar 29, 2026
9660f8c
Fix setup: use fsdp extra instead of non-existent vllm extra
Mar 29, 2026
ce0cef4
Fix causal-conv1d build: use pip instead of uv for CUDA extension
Mar 29, 2026
8b16d85
Fix cd path in run scripts for SkyRL-v2 repo layout
Mar 29, 2026
282f5fe
Add missing config fields for Fleet training overrides
Mar 29, 2026
d1342f4
Apply legacy config translation in Hydra entrypoints
Mar 29, 2026
1f196e2
Add inference_engine defaults to YAML for Hydra entrypoints
Mar 29, 2026
f223bba
Fix fleet_task double registration and task-gen data path
Mar 29, 2026
4b6c15e
Fix legacy config sync, registration, and task-gen data path
Mar 29, 2026
0dc3a17
Handle OmegaConf DictConfig in get_config_as_yaml_str
Mar 29, 2026
8393415
Replace @hydra.main with from_cli_overrides in Fleet entrypoints
Mar 29, 2026
061fcdc
Upgrade accelerate in extra-setup to fix _is_hf_initialized TypeError
Mar 29, 2026
d011232
Fix accelerate install: use --no-deps to avoid torch re-resolution
Mar 29, 2026
baa37a5
Patch Parameter.__new__ to fix _is_hf_initialized TypeError
Mar 29, 2026
acafea3
Fix config parsing and per-record env_class for fleet training
Mar 29, 2026
31e96c1
Fix task_gen rollout dir and OmegaConf struct flag for config overrides
Mar 29, 2026
e0bd62c
Export FLEET_API_KEY to Ray runtime env and improve task import error…
Mar 29, 2026
bfd547a
fix(task_gen): use data_source as fallback for env_key in extras
Mar 29, 2026
d5620dc
fix(35b): add --no-pytorch-alloc-conf to prevent vLLM CuMem crash
Mar 29, 2026
205252f
fix: sanitize multimodal content for text-only chat templates
Mar 29, 2026
3d4ad77
fix: update uids after hint augmentation extends trajectory_ids
Mar 29, 2026
0f921a7
fix: catch env.init failures in agent_loop to prevent training crash
Mar 29, 2026
35de0bb
fix: hardcode flash_attn=false in all fleet run scripts + add "$@" to…
Mar 29, 2026
763f056
fix: use [0.0] rollout_logprobs in env_init_error fallback (not None)
Mar 29, 2026
244f40d
fix: match reward format in env_init_error fallback
Mar 29, 2026
79700ee
add per-env task-gen launcher script
Mar 29, 2026
c1ceec0
fix: handle per-trajectory exceptions in generate() instead of crashi…
Mar 29, 2026
9d51e33
fix: multi-node FSDP2 stability + hint batch size for 35B training
Mar 29, 2026
09d2b43
docs: add CLAUDE.md and fleet changelog for multi-node fixes
Mar 29, 2026
b98b240
Add training trajectory logging + S3 upload
Mar 29, 2026
56cd76a
fix: add dump_training_trajectories to TrainerConfig dataclass
Mar 29, 2026
22d6ace
fix: match system prompt with old fork for VL parity
Mar 29, 2026
d40c5f7
fix: match system prompt with old fork for VL parity
Mar 29, 2026
0f023ed
Add data.env_filter for per-env dataset filtering at training time
Mar 29, 2026
9cf2d5d
fix: re-add --no-pytorch-alloc-conf for vLLM 0.18.0 CuMemAllocator co…
Mar 29, 2026
95450e2
docs: update changelog + CLAUDE.md for vLLM 0.18.0 CuMemAllocator fix
Mar 29, 2026
acf8783
merge fleet/training: re-add --no-pytorch-alloc-conf for vLLM 0.18.0
Mar 29, 2026
90af7d9
Pass data_key, data_version, env_version, env_variables through to pa…
Mar 29, 2026
5d7c878
chore: rename wandb project to fleet-tool-use-grpo
Mar 29, 2026
ee19cd9
chore: rename wandb project to fleet-tool-use-grpo
Mar 29, 2026
afb8875
merge fleet/training: rename wandb project
Mar 29, 2026
89e8799
fix: enable flash_attn for 35B training (OOM without it)
Mar 29, 2026
b115b31
fix: enable flash_attn for 35B training (OOM without it)
Mar 29, 2026
fd4be8e
merge fleet/training: enable flash_attn for 35B
Mar 29, 2026
2db6c69
Pass representative env_variables + env_variable_keys per-env to parquet
Mar 30, 2026
42fe8fd
feat(tinker): add stop-sequences, top-p, loss-fn args and fix avg_raw…
Mar 30, 2026
b0371f9
Merge branch 'fleet/training' into fleet/all
Mar 30, 2026
4a1b65d
fix: use async env methods to prevent event loop isolation
Mar 30, 2026
3a69f08
Port chunked lm_head forward + rewrite CHANGELOG as coherent document
Mar 30, 2026
3c7d2e1
Fix apply_overlong_filtering call signature
Mar 30, 2026
833a7ae
chore: reduce VL eval samples to 1 for faster iteration
Mar 30, 2026
3dca648
revert: restore eval_n_samples_per_prompt=3 for pass@3
Mar 30, 2026
786b2af
Switch to flash_attn=true + update docs with corrected diagnosis
Mar 30, 2026
a31d087
Fix logprobs/tokens shape mismatch and cap max_input_length
Mar 30, 2026
5e8ac67
fix: add retry logic to _execute_meta_tool for transient connection e…
Mar 30, 2026
3b2dc02
Pin vLLM 0.17.0 + re-enable expandable_segments + update docs
Mar 30, 2026
0f34391
chore: temporarily disable eval_before_train for training verification
Mar 30, 2026
9c92108
Revert vllm_engine.py to pre-0.18 for vLLM 0.17.0 compatibility
Mar 30, 2026
dd93554
Keep vLLM 0.18.0, reduce seq length to 72K, restore vllm_engine.py
Mar 30, 2026
5b7bb43
Update YAML MAX_INPUT_LENGTH to 72000 to match fleet-35b-run.sh
Mar 30, 2026
d7a6b48
chore: re-enable eval_before_train for production VL run
Mar 30, 2026
2d12453
chore: disable eval_before_train to verify backward pass
Mar 30, 2026
a004025
fix: parse all tool calls per turn + remove exploration gate
Mar 30, 2026
621ae49
Clarify --no-pytorch-alloc-conf mechanism in CHANGELOG
Mar 30, 2026
a50ca53
Switch to flash_attn=false — flash_attn=true causes Xid 31 with vLLM …
Mar 30, 2026
816064b
Re-enable eval_before_train for production VL run
Mar 30, 2026
32a8022
docs: update CHANGELOG fix #4 — flash_attn=false, verified working st…
Mar 30, 2026
54d74ad
docs: update CHANGELOG — 10 steps verified, checkpoint at step 10
Mar 30, 2026
ef7687f
fix: disable hints during training by default
Mar 30, 2026
36d7f32
feat: tool-call reward shaping + increase context to 65K
Mar 31, 2026
efe1fb0
feat: LLM classifier gate to filter broken tasks before harness
Mar 31, 2026
f653a30
fix: remove read-write mismatch check, use Sonnet 4.5 for classifier
Mar 31, 2026
3cfd44e
Merge pull request #6 from fleet-ai/fleet/all
dzorlu Mar 31, 2026
4fa377c
chore: update workdir.ref to main for task-gen YAML
Mar 31, 2026
cf0098c
chore: update workdir.ref to main for VL and 35B YAMLs
Mar 31, 2026
9990be6
feat(task-gen): v4 reward hacking fixes — judge, exploration, schema …
dzorlu Apr 1, 2026
690aad7
fix: update per-env launch script for v4
Apr 1, 2026
ffa5238
fix: use case statement for seed counts (bash compat)
Apr 1, 2026
b967f43
Reduce VL max_input_length from 128K to 96K to prevent OOM
Apr 1, 2026
519293d
Compact schema + remove describe_db tool
Apr 2, 2026
c6773a3
feat(task-gen): verifier hardening — exploration gate, anti-permissiv…
dzorlu Apr 3, 2026
35efb8e
feat(task-gen): add 35B task-gen YAML and run script (#15)
dzorlu Apr 3, 2026
086807e
feat: LLM-synthesized hints for failed trajectories
Apr 4, 2026
19ad98c
Enable partial_reward for VL training
Apr 4, 2026
b6174df
fix: use OpenRouter via litellm instead of direct Anthropic API
Apr 4, 2026
8cf2fd8
fix: use correct OpenRouter model ID for hint synthesis
Apr 4, 2026
c03cf79
Binary reward + truncate query_db responses
Apr 7, 2026
bb2e5de
Fix binary reward: restore base_quality + ablation config
Apr 7, 2026
1b08b60
Fix submission nudge: append to tool results, not dead branch
Apr 8, 2026
d89caf6
CLAUDE.md: report binary variance reward, not just pass@8
Apr 8, 2026
06cb395
v5.1: verifier dry-run, MCP tool prompt, earlier nudge, zero_variance…
Apr 9, 2026
8a8fbde
35b: baseline on v6, disable hints
Apr 10, 2026
d7d087e
VL training: add browser_use modality support, switch to v6 data
Apr 12, 2026
e9ebbef
fix: allow browser_use modality in fleet-common-setup.sh validation
Apr 12, 2026
1ff4065
fix: add 900s trajectory timeout to VL training
Apr 12, 2026
3109829
35b: use triton GDN prefill to avoid FlashInfer JIT hang
Apr 12, 2026
af81a43
Revert "35b: use triton GDN prefill to avoid FlashInfer JIT hang"
Apr 12, 2026
8a87a25
35b: enable eval_before_train for step 0 baseline
Apr 13, 2026
0dc943e
fix: CLAUDE.md primary branch is main, not fleet/all
Apr 14, 2026
9b7fc1f
fix: wire S3 upload for eval results after every eval
Apr 14, 2026
5a84fde
35b: eval_interval=10 (was 20)
Apr 14, 2026
e95a47b
feat: add checkpoint broadcast to workers for
sumi-fleet-hub Apr 17, 2026
d643e41
fix: gather checkpoint shards from workers before S3
sumi-fleet-hub Apr 17, 2026
9f7137e
fix: increase broadcast timeout to 30min for large checkpoints
sumi-fleet-hub Apr 17, 2026
d1b9d87
fix: dynamic rsync timeout based on checkpoint size
sumi-fleet-hub Apr 17, 2026
bc184d7
Merge pull request #17 from fleet-ai/fix/multi-node-checkpoint
sumi-fleet-hub Apr 17, 2026
c7631c7
VL v1: lr 5e-7, max_turns 64, eval_before_train false
Apr 18, 2026
6a5a81a
VL v1.1: max_input_length 96K → 72K (fix NaN gradients)
Apr 18, 2026
179b23c
Revert "VL v1.1: max_input_length 96K → 72K (fix NaN gradients)"
Apr 18, 2026
3360624
VL v2: max_input_length 96K → 80K (fix NaN gradients)
Apr 18, 2026
2718251
VL v3: max_input_length 64K, zero_variance_filter=true
Apr 19, 2026
3b9a85e
feat: port HybridEnvSampler from SkyRL-archived
Apr 25, 2026
7fc32ab
VL: 2-node, batch_size=50, min_samples_per_env=2 (#18)
dzorlu Apr 25, 2026
1ab8ef4
feat: add Fleet eval-only entrypoint with S3 checkpoint resume
sumi-fleet-hub Apr 26, 2026
0865c8a
eval_before_train=true for checkpoint resume eval
Apr 26, 2026
62d4b73
Merge pull request #19 from fleet-ai/feat/eval-only-entrypoint
sumi-fleet-hub Apr 28, 2026
5dfd198
Prioritize RunPod reserved H200s in SkyPilot task configs
Apr 29, 2026
e64bc7e
VL: increase max_input_length to 80K for longer browser trajectories
Apr 30, 2026
6ad8c76
VL: increase max_turns 64→80 for browser-use turn limit ablation
May 1, 2026
84b49de
feat: save screenshots in trajectory dumps for VL training
May 4, 2026
646d5a9
feat: save screenshots in eval trajectory dumps too
May 4, 2026
abf4008
VL: set eval_before_train=false to skip 10h eval overhead
May 4, 2026
9e9f648
Add taste-reward shaping on top of main (rebased)
May 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# SkyRL-v2 (fleet-ai/SkyRL-v2)

Fork of SkyRL with Fleet-specific optimizations for multi-node FSDP2 training at scale.

## Fleet Integration

Fleet-specific changes, fixes, and context are documented in:
- **[integrations/fleet/CHANGELOG.md](integrations/fleet/CHANGELOG.md)** — detailed changelog with root causes and fixes

Always consult the changelog before modifying Fleet training paths (`fsdp_worker.py`, `worker.py`, `model_wrapper.py`, `dispatch.py`, `fleet-*.sh`).

## Key Differences from Upstream SkyRL

1. **Multi-node FSDP2 stability**: Synchronous ref model offload/backload with `torch.distributed.barrier()` in `fsdp_worker.py`. Required because cross-node colocated training has no shared CUDA context.

2. **Chunked lm_head forward**: `model_wrapper.py` has `loss_chunk_size` support ported from the old fork. Avoids materializing full `(B, S, vocab_size)` logits — critical for 35B with 131K vocab at 97K sequence length. Without it, OOM/Xid 31 during training forward.

3. **CUDA memory management for 35B**: `torch.cuda.empty_cache()` before backward pass in `worker.py` (policy + critic). Prevents OOM from fragmentation.

4. **Reduced sequence length (72K) for 35B**: `fleet-35b-run.sh` uses `MAX_INPUT_LENGTH=72000` (down from 96000) with `--no-pytorch-alloc-conf` (disables `expandable_segments` which conflicts with vLLM 0.18.0's `CuMemAllocator`). At 97K, SDPA OOM'd and flash_attn hit Xid 31 in GatedDeltaNet. At 72K, flash_attn=true + chunked lm_head + empty_cache fits without expandable_segments.

5. **`stage_chunks` pre-staging**: `dispatch.py` has a `stage_chunks` optimization (not in upstream) that pre-stages mini-batch chunks in Ray object store. Includes dynamic `mini_batch_size` adjustment for hint augmentation's variable batch sizes.

## Training Scripts

- `scripts/fleet-common-run.sh` — shared infra (Ray, NCCL, gIB detection, deps). Used by all runs.
- `scripts/fleet-35b-run.sh` — Qwen3.5-35B config. Calls `fleet-common-run.sh`.
- `scripts/fleet-9b-run.sh` — Qwen3.5-9B config. Calls `fleet-common-run.sh`.

All training flags live in these scripts. Never duplicate flags in SkyPilot YAMLs or fleet-research scripts.

## Task-Gen Metrics

When reporting task-gen training metrics, distinguish between:
- **pass@8 / avg_raw_reward**: includes `base_quality=0.1` for passing sandbox+judge. Misleading — inflated by gate-passing alone.
- **binary variance reward**: the actual learning signal. `1.0` when solver rollouts are mixed (at least 1 pass + 1 fail), `0.0` otherwise. This is what matters.

Report binary variance reward count (how many tasks got `reward >= 1.0`) separately from gate-pass count. Check `EVAL` log lines for `total=1.0000` vs `total=0.0000`.

## Branch

Primary development branch: `main`
147 changes: 147 additions & 0 deletions docs/taste/LAUNCH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Taste-Judge GRPO Launch Recipe

Wires `research/judge/judge.py` into the SkyRL Fleet GRPO training loop.
Reward shape is **GATED TASTE**:

```
effective_taste = max(taste_floor, taste_score) # 1.0 if judge fails / None
reward = verifier_reward * effective_taste
```

Blended only on the terminal step of each rollout, with a 10s judge timeout
and verifier-only fallback (`effective_taste = 1.0`, so reward collapses to
`verifier_reward`) on timeout/exception/None.

### Why gated > additive

The previous additive shape `R = alpha * verifier + (1-alpha) * taste`
rewarded "pretty failures" — a trajectory that fails the verifier (v=0)
but narrates clean intent (t high) earned `(1-alpha) * t > 0`, which
incentivized the policy to learn good-looking failure modes. Gated taste
closes this hack: `verifier=0` forces `reward=0` regardless of taste, so
there is zero gradient toward pretty-failure mimicry. Among successes,
ugly successes still earn `floor * verifier` (default `floor=0.1`) so GRPO
sees within-group taste variance and can prefer pretty successes; setting
`floor=1.0` collapses the shape to pure verifier and serves as a clean
ablation baseline. **The floor is set to 0.1 (not 0.3) because offline
analysis showed mean rescaled taste of verifier=1 trajectories is ~0.13;
floor=0.3 would clip nearly all successes and kill within-group variance.
Re-tune floor after a 50-100 step pilot using the empirical effective_taste
P25 logged in WandB.**

## One-block launch

```bash
# 0. From your machine:
cd /tmp && rm -rf skyrl-fleet && git clone https://github.com/fleet-ai/skyrl-fleet.git
cd /tmp/skyrl-fleet

# 1. Apply the env patch (adds taste_floor config, _apply_taste_reward helper,
# and updates the three terminal returns + get_metrics).
git apply /Users/alliegu/Desktop/fleet/integration/env.py.diff

# 2. Vendor the taste-judge package into the workdir Python path.
cp -r /Users/alliegu/Desktop/fleet/integration/skyrl_taste skyrl-gym/skyrl_taste
cp -r /Users/alliegu/Desktop/fleet/research/judge research/judge

# 3. Drop the new YAML config into tasks/.
cp /Users/alliegu/Desktop/fleet/integration/configs/openenv-fleet-grpo-vl-taste.yaml \
tasks/openenv-fleet-grpo-vl-taste.yaml
Comment on lines +41 to +49
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The launch and rollback instructions include hardcoded, user-specific absolute paths (e.g., /Users/alliegu/...). This prevents other developers from being able to follow these instructions directly. Please replace these with relative paths from the repository root or use placeholders like <path-to-repo> to make the documentation reproducible.


# 4. Sky launch with the new yaml + new env vars (judge keys are NEW; the rest
# are unchanged from the existing VL launch).
sky launch tasks/openenv-fleet-grpo-vl-taste.yaml \
--env FLEET_API_KEY="$FLEET_API_KEY" \
--env WANDB_API_KEY="$WANDB_API_KEY" \
--env AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" \
--env AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" \
--env ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
--env OPENAI_API_KEY="$OPENAI_API_KEY"
```

## Required env vars

- `ANTHROPIC_API_KEY` — **required**. Default judge backend (Claude via
`research/judge/judge.py`). Without it the judge import fails and the env
silently falls back to verifier-only reward (you'll see
`taste_judge_failed=True` in WandB).
- `OPENAI_API_KEY` — **only required if running inter-rater agreement
passes** (GPT-4o judge for cross-checking Claude scores during eval). Not
needed for the standard training run.
- `FLEET_API_KEY`, `WANDB_API_KEY`, `AWS_ACCESS_KEY_ID`,
`AWS_SECRET_ACCESS_KEY` — same as the upstream VL launch.

**Important:** Invoke `judge.py` with `blind_outcome=True` at training time
to suppress outcome bleed (Stream 4 finding — when the judge sees the
verifier outcome, taste scores correlate ~0.7 with verifier and the
shaping signal collapses to a noisy duplicate of the binary reward). The
async wrapper in `skyrl_taste/judge.py` handles this; double-check the
flag is forwarded if you swap the wrapper.

## WandB metrics to watch

- `reward/train/mean` — gated reward; bounded above by verifier mean.
- `env/taste_reward` — judge's [0,1] raw score per trajectory.
- `env/effective_taste` — `max(floor, taste_reward)`; what actually
multiplies the verifier.
- `env/verifier_reward` — raw binary verifier per trajectory.
- `env/taste_floor` — the configured floor; sanity-check.
- `env/taste_judge_failed` — should stay near 0; spikes mean Anthropic
outage or judge parse failures (auto-fallback to pure verifier engaged).
- **Cross-check**: in within-group runs, plot Pearson(`taste_reward`,
`verifier_reward`). If correlation collapses below ~0.3, the judge is
scoring a different signal than the verifier — that's the expected case
and where the shaped-reward gradient comes from. If it climbs above
~0.7, suspect outcome bleed (re-verify `blind_outcome=True`).
- `reward/train/variance_per_prompt` and `signal_ratio` (from
`integrations/fleet/reward_metrics.py`) should *increase* relative to a
verifier-only baseline on groups with mixed pretty/ugly successes.

## Rollback

**Runtime kill switch (no redeploy):**
```bash
sky exec <cluster> "echo SKYRL_TASTE_DISABLED=1 >> ~/.bashrc && pkill -HUP -f main_fleet"
# or update the SkyPilot env block and re-launch with --env SKYRL_TASTE_DISABLED=1
```
This makes `score_trajectory_async` return `None`, the env's
`effective_taste` becomes `1.0`, and reward collapses to pure verifier.

**Full revert (uncheck-out the patch):**
```bash
cd /tmp/skyrl-fleet
git apply -R /Users/alliegu/Desktop/fleet/integration/env.py.diff
rm -rf skyrl-gym/skyrl_taste research/judge
```

## Two-knob ablation (floor x grpo_norm_by_std)

| floor \ grpo_norm_by_std | true (default) | false (recommended w/ gated taste) |
|---|---|---|
| 0.0 (pure multiplicative) | Ugly successes get R=0; group std collapses on all-ugly groups. Heavy gradient damping; expect slow learning. | Same dynamics, undamped; risk of policy ignoring ugly successes entirely. |
| 0.1 | Tiny within-success variance; std-norm wipes most of the gradient. | Tight bonus for pretty successes; conservative shaping. |
| 0.1 (default) | Tiny within-success variance from floor itself; std-norm still wipes most of the gradient. | **Headline candidate.** Multiplicative-with-cushion; closes hack and matches the empirical taste distribution. |
| 0.3 | Within-success std damped; offline data shows nearly all successes clip to floor — kills the signal. | Heavier shaping; only sensible if live taste distribution skews high. |
| 0.5 | Floor close to pretty-mid; less taste differentiation among successes. | Shallower shaping; useful as sensitivity check. |
| 1.0 (pure verifier) | **Identical to upstream baseline.** A/B control, no taste in std. | Identical to upstream too (no taste in std). |

Recommended order: run cell `(0.1, false)` first as the headline candidate,
then `(0.1, true)` to measure the std-norm effect, then `(1.0, true)` as
the upstream baseline. `(0.0, false)` is a diagnostic: confirms the gate
itself bites (ugly successes get zero) without floor compensation.

## Risks / gotchas

- **Judge latency budget**: 10s timeout x `n_samples_per_prompt=4` at
`train_batch_size=50` = ~200 concurrent judge calls per training step.
Anthropic rate limits will throttle you before the GPU does. Watch
`taste_judge_failed` — sustained >10% means raise the limit or batch.
- **Reward range**: gated reward is in `[0, 1]` — same as verifier — so
pass@n threshold (`reward >= 1.0` in `reward_metrics.py:79-82`) only
triggers on `(verifier=1, taste=1.0)`. With `floor=0.1` and `verifier=1`,
blended max is 1.0 only when `taste_score=1.0`. **Pass@n will look
worse than verifier-only**; report it alongside the new gated-reward
mean, and consider plotting `verifier_reward >= 1.0` as a separate
pass@n line for direct comparison to the baseline.
- **Outcome bleed**: confirmed Stream 4 risk if the judge ever sees the
verifier outcome. Keep `blind_outcome=True` in `score_trajectory_async`.
Loading