Skip to content

Add FoldGRPO advantage estimator and process_rewards pipeline #1514

Open
sumi-fleet-hub wants to merge 5 commits into
NovaSky-AI:mainfrom
sumi-fleet-hub:feat/fold-grpo
Open

Add FoldGRPO advantage estimator and process_rewards pipeline #1514
sumi-fleet-hub wants to merge 5 commits into
NovaSky-AI:mainfrom
sumi-fleet-hub:feat/fold-grpo

Conversation

@sumi-fleet-hub
Copy link
Copy Markdown

@sumi-fleet-hub sumi-fleet-hub commented Apr 15, 2026

Adds the FoldGRPO advantage estimator from Scaling Long-Horizon LLM Agent via Context-Folding and the process_rewards plumbing needed to support it.

What is FoldGRPO

FoldGRPO is GRPO with token-level process rewards. The group baseline is computed from outcome rewards only:

$$\hat{A}_{i,t} = \frac{\text{clip}(R_i + Q_{i,t},\; \text{low},\; \text{high}) - \text{mean}(\{R_j\})}{\text{std}(\{R_j\})}$$

Today SkyRL's reward path only carries a single channel (GeneratorOutput, TrainingInputBatch, advantage estimator), used for outcome rewards. There's no way to pass process rewards alongside them. This PR adds that second channel.

Changes

  • GeneratorOutput: optional process_rewards field
  • TrainingInputBatch / TrainingInput: optional process_rewards tensor
  • convert_prompts_responses_to_batch_tensors: pads and right-aligns process rewards
  • RayPPOTrainer: forwards process_rewards through to advantage estimator via kwargs
  • ppo_utils.py: new fold_grpo advantage estimator registered in enum and registry
  • config.py: FoldGRPOConfig with reward_clip_low / reward_clip_high
  • Example script and entry point under examples/train/algorithms/fold_grpo/
  • Test fixups for the new 8-tuple return and GeneratorOutput field list

All changes are optional and None-safe. Existing algorithms and environments are unaffected.

What this does not include

The branching/folding scaffold (branch/return tools, KV-cache rollback, context management during rollout) and the process reward computation (Unfolded-Token Penalty, Out-of-Scope Penalty, Failure Penalty) are application-specific and live outside the core framework. This PR provides the training-side plumbing so environments that emit process rewards can use them.


Open with Devin

devin-ai-integration[bot]

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

sumi-fleet-hub and others added 4 commits April 14, 2026 18:29
   fix test unpackings for 8-tuple return
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants