This Paper from Google DeepMind Presents Conditioned Language Policies (CLP): A Machine Learning Framework for Finetuning Language Models on Multiple Objectives

Reinforcement Learning (RL) finetuning is an important step in training language models (LMs) to behave in specific ways and follow human etiquette. In today’s applications, RL finetuning involves multiple goals due to various human preferences and uses. The multi-objective finetuning (MOFT) is needed to train a multi-objective LM to overcome the limitations of single-objective finetuning (SOFT). For LMs, MOFT has been explored through prompt-based and parameter-based methods. Prompt-based methods finetune an LM by including the reward weightings in the prompt. However, this approach can be less effective in guiding the model and sensitive to how the weightings are presented. Further, zero-shot MOFT might perform badly on intermediate weightings, which are not encountered during training.

The two main techniques to approach multi-reward alignment (or MOFT) are prompt-based, and parameter-based conditioning. Prompt-based conditioning contains approaches like Personalized Soups (PS), which use custom prompts to personalize language models (LMs) based on binary weights for different rewards. Rewarded Soups (RS) offers a zero-shot method by averaging the parameters of LMs trained independently at inference time. A recent paper introduces embedding reward weightings as singular values within the AdaLoRA framework. For KL realignment, decoding time realignment linearly mixes logits between 𝜋ref and another LM learned through SOFT with the minimum KL weight.

A team from Google has proposed a general MOFT framework called Conditional Language Policy (CLP), that uses parameter-space conditioning and multi-task training. This method is more steerable than purely prompt-based techniques because it uses parameter-conditioning from RS. Moreover, CLP produces higher-quality responses than zero-shot methods like RS by finetuning on different reward weightings, while having the same or better steerability. The team conducted a series of experiments and found that CLP outperforms Pareto-dominates RS and is more controllable than prompt-based MOFT. It consistently maintains these advantages in various conditions, including different reward choices and model sizes.

The proposed method CLP, learns a set of parameters that can be processed into a conditioned language model (LM) for any given weighting across rewards and KL, using a parameter-averaging method. The learning algorithm samples a range of weightings to improve its Pareto-front for all weightings at once. This approach includes multi-task learning across different weightings, maximizing the MOFT objective. An automated evaluation with Gemini 1.0 Ultra shows that CLP is more adaptable and generates better responses than existing baselines. The team proposed a new theory showing that zero-shot methods can be nearly Pareto-optimal when optimal policies are aligned for individual reward.

The benchmarking results were obtained for the following setups: Single Reward, Multi KL Regularizer, Two Rewards, Fixed KL Regularizer, and Three Rewards, Fixed KL Regularizer. In the Single Reward, CLP is 2 times more computationally efficient than DeRa during inference because DeRa makes two LM calls per token. The multi-task training helps this method enhance over the zero-shot RS baseline regarding performance. Also, the full-CLP and attn-CLP maintain a more spread-out and steerable Pareto-front compared to logit-CLP and the prompting baseline. In sum, attn-CLP offers a good balance between Pareto-front and steerability while using fewer parameters than current baselines.

This Paper from Google DeepMind Presents Conditioned Language Policies (CLP): A Machine Learning Framework for Finetuning Language Models on Multiple Objectives 1

In this paper, a team from Google introduced Conditional Language Policy (CLP), a flexible framework for MOFT that uses multi-task training and efficient parameter finetuning to create adaptable language models (LMs) that can balance different individual rewards efficiently. The paper includes extensive benchmarking and ablation studies to understand the factors that help develop steerable LMs within the CLP framework. The team also proposed theoretical results to show the working of zero-shot approaches and the need for multi-task training for near-optimal behavior. Future research includes other conditioning mechanisms like soft tokens, automating the tuning of weight sampling distributions, and addressing non-linear reward scalarization.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

Source link

What's Hot

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

This Paper from Google DeepMind Presents Conditioned Language Policies (CLP): A Machine Learning Framework for Finetuning Language Models on Multiple Objectives

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding

Nous Research Introduces Two New Projects: The Forge Reasoning API Beta and Nous Chat

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Our Picks

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

What's Hot

This Paper from Google DeepMind Presents Conditioned Language Policies (CLP): A Machine Learning Framework for Finetuning Language Models on Multiple Objectives

Related Posts

Leave A Reply Cancel Reply