Is your feature request related to a problem? Please describe.
GRPO inherently compresses the reward signal, causing loss of information in the advantage estimates. Group-wise decoupled normalization of each reward separately better preserves cross-reward distinctions and enables more accurate multi-reward optimization.
Describe the solution you'd like
https://arxiv.org/abs/2601.05242
Additional context
Authors have reported huge improvement in results
How i came to this --?
As part of Google Tunix hackathon i have observed the problem of GRPO with a multi reward optimization. I found this latest paper directly questions the issues of loss of information in advantage estimates in GRPO hence i would love to add this to Tunix!
Checklist
Is your feature request related to a problem? Please describe.
GRPO inherently compresses the reward signal, causing loss of information in the advantage estimates. Group-wise decoupled normalization of each reward separately better preserves cross-reward distinctions and enables more accurate multi-reward optimization.
Describe the solution you'd like
https://arxiv.org/abs/2601.05242
Additional context
Authors have reported huge improvement in results
How i came to this --?
As part of Google Tunix hackathon i have observed the problem of GRPO with a multi reward optimization. I found this latest paper directly questions the issues of loss of information in advantage estimates in GRPO hence i would love to add this to Tunix!
Checklist