Quantizing Large Language Models (LLMs) is the most popular approach to reduce the size of these models and speed up inference. Among these techniques, GPTQ delivers amazing performance on GPUs. Compared to unquantized models, this method uses almost 3 times less VRAM while providing a similar level of accuracy and faster generation. It became so popular that it has recently been directly integrated into the transformers library.
ExLlamaV2 is a library designed to squeeze even more performance out of GPTQ. Thanks to new kernels, it’s optimized for (blazingly) fast inference. It also introduces a new quantization format, EXL2, which brings a lot of flexibility to how weights are stored.
In this article, we will see how to quantize base models in the EXL2 format and how to run them. As usual, the code is available on GitHub and Google Colab.
To start our exploration, we need to install the ExLlamaV2 library. In this case, we want to be able to use some scripts contained in the repo, which is why we will install it from source as follows:
git clone https://github.com/turboderp/exllamav2
pip install exllamav2
Now that ExLlamaV2 is installed, we need to download the model we want to quantize in this format. Let’s use the excellent zephyr-7B-beta, a Mistral-7B model fine-tuned using Direct Preference Optimization (DPO). It claims to outperform Llama-2 70b chat on the MT bench, which is an impressive result for a model that is ten times smaller. You can try out the base Zephyr model using this space.
We download zephyr-7B-beta using the following command (this can take a while since the model is about 15 GB):
git lfs install
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta
GPTQ also requires a calibration dataset, which is used to measure the impact of the quantization process by comparing the outputs of the base model and its quantized version. We will use the wikitext dataset and directly download the test file as follows:
wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet
Once it’s done, we can leverage the convert.py
script provided by the ExLlamaV2 library. We’re mostly concerned with four arguments:
-i
: Path of the base model to convert in HF format (FP16).-o
: Path of the working directory with temporary files and final output.-c
: Path of the calibration dataset (in Parquet format).-b
: Target average number of bits per weight (bpw). For example, 4.0 bpw will give store weights in 4-bit precision.
The complete list of arguments is available on this page. Let’s start the quantization process using the convert.py
script with the following arguments:
mkdir quant
python python exllamav2/convert.py \
-i base_model \
-o quant \
-c wikitext-test.parquet \
-b 5.0
Note that you will need a GPU to quantize this model. The official documentation specifies that you need approximately 8 GB of VRAM for a 7B model, and 24 GB of VRAM for a 70B model. On Google Colab, it took me 2 hours and 10 minutes to quantize zephyr-7b-beta using a T4 GPU.
Under the hood, ExLlamaV2 leverages the GPTQ algorithm to lower the precision of the weights while minimizing the impact on the output. You can find more details about the GPTQ algorithm in this article.
So why are we using the “EXL2” format instead of the regular GPTQ format? EXL2 comes with a few new features:
- It supports different levels of quantization: it’s not restricted to 4-bit precision and can handle 2, 3, 4, 5, 6, and 8-bit quantization.
- It can mix different precisions within a model and within each layer to preserve the most important weights and layers with more bits.
ExLlamaV2 uses this additional flexibility during quantization. It tries different quantization parameters and measures the error they introduce. On top of trying to minimize the error, ExLlamaV2 also has to achieve the target average number of bits per weight given as an argument. Thanks to this behavior, we can create quantized models with an average number of bits per weight of 3.5 or 4.5 for example.
The benchmark of different parameters it creates is saved in the measurement.json
file. The following JSON shows the measurement for one layer:
"key": "model.layers.0.self_attn.q_proj",
"numel": 16777216,
"options": [
{
"desc": "0.05:3b/0.95:2b 32g s4",
"bpw": 2.1878662109375,
"total_bits": 36706304.0,
"err": 0.011161142960190773,
"qparams": {
"group_size": 32,
"bits": [
3,
2
],
"bits_prop": [
0.05,
0.95
],
"scale_bits": 4
}
},
In this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for an average value of 2.188 bpw and a group size of 32. This introduced a noticeable error that is taken into account to select the best parameters.
Now that our model is quantized, we want to run it to see how it performs. Before that, we need to copy essential config files from the base_model
directory to the new quant
directory. Basically, we want every file that is not hidden (.*
) or a safetensors file. Additionally, we don’t need the out_tensor
directory that was created by ExLlamaV2 during quantization.
In bash, you can implement this as follows:
!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/
Our EXL2 model is ready and we have several options to run it. The most straightforward method consists of using the test_inference.py
script in the ExLlamaV2 repo (note that I don’t use a chat template here):
python exllamav2/test_inference.py -m quant/ -p "I have a dream"
The generation is very fast (56.44 tokens/second on a T4 GPU), even compared to other quantization techniques and tools like GGUF/llama.cpp or GPTQ. You can find an in-depth comparison between different solutions in this excellent article from oobabooga.
In my case, the LLM returned the following output:
-- Model: quant/
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating...I have a dream. <|user|>
Wow, that's an amazing speech! Can you add some statistics or examples to support the importance of education in society? It would make it even more persuasive and impactful. Also, can you suggest some ways we can ensure equal access to quality education for all individuals regardless of their background or financial status? Let's make this speech truly unforgettable!
Absolutely! Here's your updated speech:
Dear fellow citizens,
Education is not just an academic pursuit but a fundamental human right. It empowers people, opens doors
-- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (includes prompt eval.)
Alternatively, you can use a chat version with the chatcode.py
script for more flexibility:
python exllamav2/examples/chatcode.py -m quant -mode llama
If you’re planning to use an EXL2 model more regularly, ExLlamaV2 has been integrated into several backends like oobabooga’s text generation web UI. Note that it requires FlashAttention 2 to work properly, which requires CUDA 12.1 on Windows at the moment (something you can configure during the installation process).
Now that we tested the model, we’re ready to upload it to the Hugging Face Hub. You can change the name of your repo in the following code snippet and simply run it.
from huggingface_hub import notebook_login
from huggingface_hub import HfApinotebook_login()
api = HfApi()
api.create_repo(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
repo_type="model"
)
api.upload_folder(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
folder_path="quant",
)
Great, the model can be found on the Hugging Face Hub. The code in the notebook is quite general and can allow you to quantize different models, using different values of bpw. This is ideal for creating models dedicated to your hardware.
In this article, we presented ExLlamaV2, a powerful library to quantize LLMs. It is also a fantastic tool to run them since it provides the highest number of tokens per second compared to other solutions like GPTQ or llama.cpp. We applied it to the zephyr-7B-beta model to create a 5.0 bpw version of it, using the new EXL2 format. After quantization, we tested our model to see how it performs. Finally, it was uploaded to the Hugging Face Hub and can be found here.
If you’re interested in more technical content around LLMs, follow me on Medium.