Training a large language model requires significant computational resources, including powerful GPUs and TPUs, as well as specialized hardware such as AI accelerators. These resources can be expensive to acquire and maintain. Gathering and preparing the vast amounts of data needed to train large language models can be a costly and time-consuming process. High-quality, diverse, and representative datasets are essential for model performance.
Training large language models can take weeks or even months, depending on the model’s size and complexity. Sparsity is a natural approach to reducing this cost. The existing methods require costly retraining or do not yield wall-clock time speedup on modern hardware. The researchers have developed a new input-dependent set of attention heads and MLP parameters that yield approximately the same output as the dense models with a given input for a longer time.
They hypothesize that contextual sparsity exists, and when they are accurately predicted, they can speed up the LLM inference in wall-clock time without compromising LLM’s quality or in-context learning ability. They propose “DEJAVU“, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware implementation that speeds up LLM inference.
Even if contextual sparsity exists, it is challenging to predict the sparsity for a given input in advance. It is nontrivial to verify if such contextual sparsity exists, and naive verification can be prohibitively expensive. It might also be difficult to achieve end-to-end wall-clock time speedup. The team has verified the existence of such sparsity with a simple approach. Contextual sparsity depends not only on individual input tokens but also on their interactions. Only with token embeddings with sufficient contextual information, they predict sparsity accurately.
The contextual sparsity in the MLP block can be identified after computing the activation. However, this only demonstrates the existence of contextual sparsity but brings no benefits in terms of efficiency. A fast and precise prediction is needed to exploit contextual sparsity for end-to-end efficiency.
DEJAVU uses lookahead predictors to side-step prediction costs. Given the input to the attention layer at block k, they asynchronously predict the contextual sparsity for the MLP at block k and provide the information to the MLP at block k. They then predict the sparsity for the attention head at the next layer. They also claim that contextual sparsity can be accurately predicted with lightweight learning-based algorithms.
Researchers find that DEJAVU achieves over two times reduction in token generation latency compared to the state-of-the-art FasterTransformer and over six times compared to Hugging Face with no accuracy loss. The MLP sparse predictor introduces no accuracy loss on both zero-shot tasks and language modeling. In the training of the MLP sparse predictor, they observed that the sparse predictor achieves high validation accuracy.
Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on Telegram and WhatsApp.
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.