Revolutionizing Adapter Techniques: Qualcomm AI's Sparse High Rank Adapters (SHiRA) for Efficient and Rapid Deployment in Large Language Models

A significant challenge in deploying large language models (LLMs) and latent variable models (LVMs) is balancing low inference overhead with the ability to rapidly switch adapters. Traditional methods such as Low Rank Adaptation (LoRA) either fuse adapter parameters into the base model weights, losing rapid switching capability, or maintain adapter parameters separately, incurring significant latency. Additionally, existing methods struggle with concept loss when multiple adapters are used concurrently. Addressing these issues is critical for deploying AI models in resource-constrained environments like mobile devices and ensuring robust performance across diverse applications.

LoRA and its variants are the primary techniques used to adapt large generative models. LoRA is favored for its efficiency during training and inference, but it modifies a significant portion of the base model’s weights when fused, leading to large memory and latency costs during rapid switching. In unfused mode, LoRA incurs up to 30% higher inference latency. Furthermore, LoRA suffers from concept loss in multi-adapter settings, where different adapters overwrite each other’s influence, degrading the model’s performance. Sparse adaptation techniques have been explored, but they often require complex implementations and do not fully address rapid switching and concept retention issues.

Researchers from Qualcomm AI propose Sparse High Rank Adapters (SHiRA), a highly sparse adapter framework that modifies only 1-2% of the base model’s weights. This framework enables rapid switching by minimizing the number of weights that need to be updated and mitigates concept loss through its sparse structure. The researchers leveraged gradient masking to update only the most critical weights during training, maintaining high performance with minimal parameter changes. SHiRA’s design ensures that it remains lightweight and efficient, making it suitable for deployment on mobile devices and other resource-constrained environments.

Revolutionizing Adapter Techniques: Qualcomm AI's Sparse High Rank Adapters (SHiRA) for Efficient and Rapid Deployment in Large Language Models 1

The SHiRA framework is implemented using a gradient masking technique where a sparse mask determines which weights are trainable. Various strategies to create these masks include random selection, weight magnitude, gradient magnitude, and SNIP (Sensitivity-based Pruning). The adapters can be rapidly switched by storing only the non-zero weights and their indices and applying them using efficient scatter operations during inference. The researchers also provide a memory- and latency-efficient implementation based on the Parameter-Efficient Fine-Tuning (PEFT) library, which reduces GPU memory usage by 16% compared to standard LoRA and trains at nearly the same speed.

SHiRA demonstrates superior performance in extensive experiments on both LLMs and LVMs. The approach consistently outperforms traditional LoRA methods, achieving up to 2.7% higher accuracy in commonsense reasoning tasks. Additionally, SHiRA maintains high image quality in style transfer tasks, effectively addressing the concept loss issues that plague LoRA. SHiRA achieves significantly higher HPSv2 scores on style transfer datasets, indicating superior image generation quality. By modifying only 1-2% of the base model’s weights, SHiRA ensures rapid adapter switching and minimal inference overhead, making it highly efficient and practical for deployment in resource-constrained environments such as mobile devices.

In conclusion, Sparse High Rank Adapters (SHiRA) represent a significant advancement in adapter techniques for AI models. SHiRA addresses critical challenges of rapid adapter switching and concept loss in multi-adapter settings while maintaining low inference overhead. By modifying only 1-2% of the base model’s weights, this approach offers a practical and efficient solution for deploying large models in resource-constrained environments, thus advancing the field of AI research and deployment.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit

🚀 Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generally available! [Advertisement]

Aswin AK is a consulting intern at MarkTechPost. He is pursuing his Dual Degree at the Indian Institute of Technology, Kharagpur. He is passionate about data science and machine learning, bringing a strong academic background and hands-on experience in solving real-life cross-domain challenges.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft

Source link

What's Hot

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Gradient Boosting | Towards Data Science

Revolutionizing Adapter Techniques: Qualcomm AI’s Sparse High Rank Adapters (SHiRA) for Efficient and Rapid Deployment in Large Language Models

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Gradient Boosting | Towards Data Science

The Complete Guide to NetSuite Saved Searches

Our Picks

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Gradient Boosting | Towards Data Science

What's Hot

Revolutionizing Adapter Techniques: Qualcomm AI’s Sparse High Rank Adapters (SHiRA) for Efficient and Rapid Deployment in Large Language Models

Related Posts

Leave A Reply Cancel Reply