Optimizing personalized group-relative policies for heterogeneous preference adjustment

Despite their sophisticated general-purpose capabilities, large-scale language models (LLMs) often fail to adapt to diverse individual preferences. This is because standard post-training techniques, such as reinforcement learning with human feedback (RLHF), are optimized for a single global objective. Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, but its group-based normalization implicitly assumes that all samples are interchangeable, and personalized settings also inherit this restriction. This assumption confounds the reward distributions of different users and systematically biases learning towards dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel adjustment framework that decouples benefit estimation from immediate batch statistics. By normalizing benefits against preference group-specific reward histories rather than co-generational groups, P-GRPO preserves the contrasting signals needed to learn individual preferences. We evaluated P-GRPO across a variety of tasks and found that it consistently achieved faster convergence and higher reward than standard GRPO, thereby improving its ability to recover and adjust disparate preference signals. Our results demonstrate that considering reward heterogeneity at the optimization level is essential to building models that faithfully fit diverse human preferences without sacrificing general performance.

Source link