Vector Policy Optimization: Training for Diversity Improves Test-Time Search

“`html

A new research titled “Vector Policy Optimization (VPO)” has been published, proposing an RL algorithm that trains language models to produce diverse responses. This is aimed at improving their performance in tasks like AlphaEvolve, which selects rollouts with various reward functions.
VPO differs from traditional approaches by training the model to output a set of solutions where each solution specializes to different trade-offs within a vector-valued reward space. This allows it to match or outperform existing scalar RL baselines on tasks requiring diverse outcomes, such as pass@k and best@k.

– VPO represents an advancement in how language models are trained, particularly for applications that require them to be more adaptable and diverse in their responses.
– The research suggests that optimizing for diversity could become a standard post-training objective as tasks like AlphaEvolve continue to evolve.
– This work highlights the importance of training models not just to maximize a single scalar reward but to handle multiple or evolving reward functions, which is crucial for real-world applications where conditions can change dynamically.
“`

Source Read original →