Initially designed for continuous control tasks, Proximal Policy Optimization (PPO) has become widely used in reinforcement learning (RL) applications, including fine-tuning generative models. However, PPO’s effectiveness relies on multiple heuristics for stable convergence, such as value networks and clipping, making its implementation sensitive and complex. Despite this, RL demonstrates remarkable versatility, transitioning from tasks like continuous control to fine-tuning generative models. Yet, adapting PPO, originally meant to optimize two-layer networks, to fine-tune modern generative models with billions of parameters raises concerns. This necessitates storing multiple models in memory simultaneously and raises questions about the suitability of PPO for such tasks. Also, PPO’s performance varies widely due to seemingly trivial implementation details. This raises the question: Are there simpler algorithms that scale to modern RL applications?
Policy Gradient (PG) methods, renowned for their direct, gradient-based policy optimization, are pivotal in RL. Divided into two families, PG methods based on REINFORCE often incorporate variance reduction techniques, while adaptive PG techniques precondition policy gradients to ensure stability and faster convergence. However, computing and inverting the Fisher Information Matrix in adaptive PG methods like TRPO pose computational challenges, leading to coarse approximations like PPO.
Researchers from Cornell, Princeton, and Carnegie Mellon University introduce REBEL: REgression to RElative REward Based RL. This algorithm reduces the problem of policy optimization by regressing the relative rewards via direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. Theoretical analysis reveals REBEL as a foundation for RL algorithms like Natural Policy Gradient, matching top theoretical guarantees for convergence and sample efficiency. REBEL accommodates offline data and addresses intransitive preferences that are common in practice.
The researchers adopt the Contextual Bandit formulation for RL, which is particularly relevant for models like LLMs and Diffusion Models due to deterministic transitions. Prompt-response pairs are considered with a reward function to measure response quality. The KL-constrained RL problem is formulated to fine-tune the policy according to rewards while adhering to a baseline policy. A closed-form solution to the relative entropy problem is derived from prior research work, allowing the reward to be expressed as a function of the policy. REBEL iteratively updates the policy based on a square loss objective, utilizing paired samples to approximate the partition function. This core REBEL objective aims to fit the relative rewards between response pairs, ultimately seeking to solve the KL-constrained RL problem.
The comparison between REBEL, SFT, PPO, and DPO for models trained with LoRA reveals REBEL’s superior performance regarding RM score across all model sizes, albeit with a slightly larger KL divergence than PPO. Particularly, REBEL achieves the highest win rate under GPT4 when evaluated against human references, indicating the advantage of regressing relative rewards. The trade-off between reward model score and KL divergence, where REBEL exhibits higher divergence but achieves larger RM scores than PPO, especially towards the end of training.
In conclusion, this research presents REBEL, a simplified RL algorithm that tackles the RL problem by solving a series of relative reward regression tasks on sequentially gathered datasets. Unlike policy gradient approaches, which often rely on additional networks and heuristics like clipping for optimization stability, REBEL focuses on driving down training error on a least squares problem, making it remarkably straightforward to implement and scale. Theoretically, REBEL aligns with the strongest guarantees available for RL algorithms in agnostic settings. In practice, REBEL demonstrates competitive or superior performance compared to more complex and resource-intensive methods across language modeling and guided image generation tasks.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 40k+ ML SubReddit