State-of-the-art large language models (LLMs) are pre-trained with billions of parameters. While pre-trained LLMs can perform many tasks, they can become much better once fine-tuned.
Thanks to LoRA, fine-tuning costs can be dramatically reduced. LoRA adds low-rank tensors, i.e., a small number of parameters (millions), on top of the frozen original parameters. Only the parameters in the added tensors are trained during fine-tuning.
LoRA still requires the model to be loaded in memory. To reduce the memory cost and speed-up fine-tuning, a new approach proposes quantization-aware LoRA (QA-LoRA) fine-tuning.
In this article, I explain QA-LoRA and review its performance compared with previous work (especially QLoRA). I also show how to use QA-LoRA to fine-tune your own quantization-aware LoRA for Llama 2.
Fine-tuning LoRA on top of a quantized LLM is something that can already be done with QLoRA. In my previous articles, I used it many times to fine-tune LLMs, for instance, Llama 2 and GPT-NeoX, on my desktop computer or using the free instance of Google Colab.
Before delving into QA-LoRA, it is interesting to understand what are the current limits of QLoRA.
The NormalFloat4 (NF4) Quantization
LLM quantization algorithms usually quantize parameters to a 4-bit precision using the INT4 data type. Computation with this data type is more and more optimized with recent GPUs.
QLoRA doesn’t use INT4 by default but another data type called NormalFloat4 (NF4). You can see it as a compressed float number. According to the authors of QLoRA, NF4 is superior to INT4. LLMs quantized with NF4 achieve a lower perplexity.
However, NF4 computation is not optimal for fast inference. This is one of the reasons why…