Modify SkyRL Generator to Append Router Indices in Multi-Turn#1530
Modify SkyRL Generator to Append Router Indices in Multi-Turn#1530devpatelio wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements router replay (R3) for multi-turn search tasks, transitioning from overwriting to appending rollout expert indices to maintain routing decisions across turns. It introduces a new execution script, a utility function for merging indices, and expanded test coverage for multi-turn invariants. Review feedback focuses on correcting outdated documentation and removing development TODOs in the generator logic.
I am having trouble creating individual review comments. Click here to see my feedback.
skyrl/train/generators/skyrl_gym_generator.py (991-997)
The comment at lines 991-993 is now incorrect as it states the indices are overwritten, whereas the implementation has been updated to append them. Additionally, the comment at line 995 is a redundant development note that should be removed.
# Append the new rollout expert indices. The inference engine returns indices
# for the entire sequence (including previous turns), so we only append
# the new indices to maintain the original routing decisions for earlier turns.
agent_loop_state.rollout_expert_indices = append_rollout_expert_indices(
agent_loop_state.rollout_expert_indices, rollout_expert_indices_for_turn)
skyrl/train/generators/skyrl_gym_generator.py (1072-1078)
The comment at lines 1072-1074 is now incorrect as it states the indices are overwritten, whereas the implementation has been updated to append them. Additionally, there is a leftover TODO at line 1076 that should be removed.
# Append the new rollout expert indices. The inference engine returns indices
# for the entire sequence (including previous turns), so we only append
# the new indices to maintain the original routing decisions for earlier turns.
agent_loop_state.rollout_expert_indices = append_rollout_expert_indices(
agent_loop_state.rollout_expert_indices, turn_output.rollout_expert_indices)
skyrl/train/generators/skyrl_gym_generator.py (972)
This leftover TODO comment should be removed as the logic for appending is handled later in the function.
vLLM returns new expert indices across prior turns which can often differ from the actual routing decisions that were used in earlier turns.
Turn 1:
Input: [A] w/ R3 indices A_r
Output: [B] w/ R3 indices B_r
Observation: O
Record [A_r, B_r].
Turn 2:
Input: [A, B, O] w/ R3 indices [AB_r, O_r]
Output: [C] w/ R3 indices C_r
The previous approach was to override the recorded indices, but in the case that [AB_r, O_r] != [A_r, B_r], we need to instead append O and C to the original recorded indices instead.