Meet LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

The introduction of Pre-trained Language Models (PLMs) has signified a transformative shift in the field of Natural Language Processing. They have demonstrated exceptional proficiency in performing a wide range of language tasks, including Natural Language Understanding (NLU) and Natural Language Generation (NLG). These models typically incorporate millions or even billions of parameters, leading to substantial computational and memory requirements. However, the considerable computational and memory needs of these models present significant challenges, as acknowledged by the research community.

In this paper, the authors introduce a novel quantization framework known as LoRA-Fine-Tuning-aware Quantization (LoftQ). This framework is specifically tailored for pre-trained models that necessitate quantization and LoRA fine-tuning. The framework actively combines low-rank approximation, working in conjunction with quantization to jointly approximate the original high-precision pre-trained weights.

Meet LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models 1

The above image demonstrates QLoRA performance with different bits. Left: QLoRA initialization of LLAMA-2-13b on WikiText-2. Right: Apply QLoRA to LLAMA-2-13b on WikiText-2 language modelling task. Smaller perplexity indicates better performance.

Quantization Methods. We apply two quantization methods to demonstrate LoftQ is compatible with different quantization functions:

• Uniform quantization is a classic quantization method. It uniformly divides a continuous interval into 2N categories and stores a local maximum absolute value for dequantization.

• NF4 and its 2-bit variant NF2 are quantization methods used in QLoRA. They assume that the high-precision values are drawn from a Gaussian distribution and map these values to discrete slots that have equal probability.

We perform 2-bit and 4-bit quantization on all models, achieving compression ratios of 25-30% and 15-20% at the 4-bit and 2-bit levels, respectively. All the experiments are conducted on NVIDIA A100 GPUs.

The evaluation of their quantization framework is carried out through extensive experiments on various downstream tasks, including NLU, question answering, summarization, and NLG. The results of these experiments demonstrate that LoftQ consistently surpasses QLoRA across all precision levels. For example, with 4-bit quantization, they attain a 1.1 and 0.8 improvement in Rouge-1 for XSum and CNN/DailyMail, respectively. As the field of NLP continues to advance, it is expected that further innovations and optimizations will help bridge the gap between the immense potential of PLMs and their practical deployment, benefiting a wide range of applications and users.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

Janhavi Lande, is an Engineering Physics graduate from IIT Guwahati, class of 2023. She is an upcoming data scientist and has been working in the world of ml/ai research for the past two years. She is most fascinated by this ever changing world and its constant demand of humans to keep up with it. In her pastime she enjoys traveling, reading and writing poems.

▶️ Now Watch AI Research Updates On Our Youtube Channel [Watch Now]

Source link

What's Hot

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

Meet LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding

Nous Research Introduces Two New Projects: The Forge Reasoning API Beta and Nous Chat

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Our Picks

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

What's Hot

Meet LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

Related Posts

Leave A Reply Cancel Reply