Large Language Models (LLMs) have demonstrated remarkable abilities in generating human-like text, answering questions, and coding. However, they face hurdles requiring high reliability, safety, and ethical adherence. Reinforcement Learning from Human Feedback (RLHF), or Preference-based Reinforcement Learning (PbRL), emerges as a promising solution. This framework has shown significant success in fine-tuning LLMs to align with human preferences, enhancing their usefulness.
Existing RLHF approaches, like InstructGPT, rely on explicit or implicit reward models, e.g., the Bradley-Terry model. Recent research explores direct preference probabilities to better represent human preferences. Some researchers formulate RLHF as finding Nash equilibriums in constant-sum games, proposing mirror descent and Self-play Preference Optimization (SPO) methods. Direct Nash Optimization (DNO) was also introduced based on win rate gaps, yet its practical implementation still relies on iterative DPO frameworks.
Researchers from the University of California, Los Angeles and Carnegie Mellon University introduce a robust self-play framework, Self-Play Preference Optimization (SPPO), for language model alignment addressing RLHF challenges. It offers provable guarantees for solving two-player constant-sum games and scalability for large language models. In formulating RLHF as such a game, the objective is to identify the Nash equilibrium policy, ensuring consistently preferred responses. They propose an adaptive algorithm based on multiplicative weights, employing a self-play mechanism where the policy fine-tunes itself on synthetic data annotated by the preference model.
The self-play framework aims to solve two-player constant-sum games efficiently and at scale for large language models. It adopts an iterative framework based on multiplicative weight updates and a self-play mechanism. The algorithm asymptotically converges to the optimal policy, identifying the Nash equilibrium. Theoretical analysis ensures convergence, providing provable guarantees. Compared to existing methods like DPO and IPO, SPPO demonstrates improved convergence and addresses data sparsity issues efficiently.
The researchers evaluate models using GPT-4 for automatic evaluation, presenting results on AlpacaEval 2.0 and MT-Bench. SPPO models consistently improve across iterations, with SPPO Iter3 showing the highest win rate. Compared to DPO and IPO, SPPO achieves superior performance and effectively controls output length. Test-time reranking with the PairRM reward model consistently improves model performance without over-optimization. SPPO outperforms many state-of-the-art chatbots on AlpacaEval 2.0 and remains competitive with GPT-4 on MT-Bench.
To conclude, the paper introduces Self-Play Preference Optimization (SPPO), a robust method for fine-tuning LLMs using Human/AI Feedback. By employing self-play in a two-player game and a preference-based learning objective, SPPO significantly improves over existing methods like DPO and IPO across various benchmarks. Integrating a preference model and batched estimation, SPPO aligns LLMs closely with human preferences, addressing issues like “length bias” reward hacking. These findings suggest SPPO’s potential for enhancing generative AI system alignment, advocating for its broader adoption in LLMs and beyond.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 41k+ ML SubReddit