RVPO: Risk-sensitive adjustment with variance regularization

Current critical-less RLHF methods aggregate the rewards of multiple objectives via an arithmetic average, making them vulnerable to constraint neglect. High successes in one objective numerically offset significant failures in other objectives (such as safety or formatting), masking the “bottleneck” rewards of poor performance that are essential for reliable multi-objective coordination. We propose Reward Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes the differences between rewards during profit aggregation, shifting the objective from “maximizing sum” to “maximizing consistency.” Through Taylor expansion, we show that the LogSumExp (SoftMin) operator effectively acts as a smooth distribution penalty. Evaluate RVPO based on rubric-based medical and scientific reasoning using up to 17 simultaneous LLM decision reward signals (Qwen2.5-3B/7B/14B) and tool invocation using rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from ignoring difficult constraints to exploit easier goals, RVPO improves the overall score on HealthBench (0.261 vs. 0.215 for GDPO on 14B, p < 0.001) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed with other multiple reward methods. This shows that distributed regularization alleviates constraint neglect across model scales without sacrificing general functionality.

Source link