Add FoldGRPO advantage estimator and process_rewards pipeline by sumi-fleet-hub · Pull Request #1514 · NovaSky-AI/SkyRL

sumi-fleet-hub · 2026-04-15T01:17:19Z

Adds the FoldGRPO advantage estimator from Scaling Long-Horizon LLM Agent via Context-Folding and the process_rewards plumbing needed to support it.

What is FoldGRPO

FoldGRPO is GRPO with token-level process rewards. The group baseline is computed from outcome rewards only:

$$\hat{A}_{i,t} = \frac{\text{clip}(R_i + Q_{i,t},\; \text{low},\; \text{high}) - \text{mean}(\{R_j\})}{\text{std}(\{R_j\})}$$

Today SkyRL's reward path only carries a single channel (GeneratorOutput, TrainingInputBatch, advantage estimator), used for outcome rewards. There's no way to pass process rewards alongside them. This PR adds that second channel.

Changes

GeneratorOutput: optional process_rewards field
TrainingInputBatch / TrainingInput: optional process_rewards tensor
convert_prompts_responses_to_batch_tensors: pads and right-aligns process rewards
RayPPOTrainer: forwards process_rewards through to advantage estimator via kwargs
ppo_utils.py: new fold_grpo advantage estimator registered in enum and registry
config.py: FoldGRPOConfig with reward_clip_low / reward_clip_high
Example script and entry point under examples/train/algorithms/fold_grpo/
Test fixups for the new 8-tuple return and GeneratorOutput field list

All changes are optional and None-safe. Existing algorithms and environments are unaffected.

What this does not include

The branching/folding scaffold (branch/return tools, KV-cache rollback, context management during rollout) and the process reward computation (Unfolded-Token Penalty, Out-of-Scope Penalty, Failure Penalty) are application-specific and live outside the core framework. This PR provides the training-side plumbing so environments that emit process rewards can use them.

fix test unpackings for 8-tuple return

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

visibility branch for fold-grpo ideas

0a632db

This comment was marked as resolved.

Sign in to view

SumanthRH assigned CharlieFRuan Apr 15, 2026

sumi-fleet-hub and others added 4 commits April 14, 2026 18:29

Add FoldGRPO unit tests and

637b75a

fix test unpackings for 8-tuple return

Update skyrl/backends/skyrl_train/utils/ppo_utils.py

d2909ff

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update skyrl/backends/skyrl_train/utils/ppo_utils.py

f4e31d2

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Fix test unpacking for rollout_expert_indices (Devin review)

29f7b4e

SumanthRH approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FoldGRPO advantage estimator and process_rewards pipeline #1514

Add FoldGRPO advantage estimator and process_rewards pipeline #1514
sumi-fleet-hub wants to merge 5 commits into
NovaSky-AI:mainfrom
sumi-fleet-hub:feat/fold-grpo

sumi-fleet-hub commented Apr 15, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sumi-fleet-hub commented Apr 15, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is FoldGRPO

Changes

What this does not include

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sumi-fleet-hub commented Apr 15, 2026 •

edited by devin-ai-integration Bot

Loading