Deployment of Large Language Models (LLMs)
We live in an amazing time of Large Language Models like ChatGPT, GPT-4, and Claude that can perform multiple amazing tasks. In practically every field, ranging from education, healthcare to arts and business, Large Language Models are being used to facilitate efficiency in delivering services. Over the past year, many brilliant open-source Large Language Models, such as Llama, Mistral, Falcon, and Gemma, have been released. These open-source LLMs are available for everyone to use, but deploying them can be very challenging as they can be very slow and require a lot of GPU compute power to run for real-time deployment. Different tools and approaches have been created to simplify the deployment of Large Language Models.
Many deployment tools have been created for serving LLMs with faster inference, such as vLLM, c2translate, TensorRT-LLM, and llama.cpp. Quantization techniques are also used to optimize GPUs for loading very large Language Models. In this article, I will explain how to deploy Large Language Models with vLLM and quantization.
Latency and Throughput
Some of the major factors that affect the speed performance of a Large Language Model are GPU hardware requirements and model size. The larger the size of the model, the more GPU compute power is required to run it. Common benchmark metrics used in measuring the speed performance of a Large Language Model are Latency and Throughput.
Latency: This is the time required for a Large Language Model to generate a response. It is usually measured in seconds or milliseconds.
Throughput: This is the number of tokens generated per second or millisecond from a Large Language Model.
Install Required Packages
Below are the two required packages for running a Large Language Model: Hugging Face transformers and accelerate.
pip3 install transformers
pip3 install accelerate
What is Phi-2?
Phi-2 is a state-of-the-art foundation model from Microsoft with 2.7 billion parameters. It was pre-trained with a variety of data sources, ranging from code to textbooks. Learn more about Phi-2 from here.
Benchmarking LLM Latency and Throughput with Hugging Face Transformers
Generated Output
Latency: 2.739394464492798 seconds
Throughput: 32.36171766303386 tokens/second
Generate a python code that accepts a list of numbers and returns the sum. [1, 2, 3, 4, 5]
A: def sum_list(numbers):
total = 0
for num in numbers:
total += num
return totalprint(sum_list([1, 2, 3, 4, 5]))
Step By Step Code Breakdown
Line 6–10: Loaded Phi-2 model and tokenized the prompt “Generate a python code that accepts a list of numbers and returns the sum.”
Line 12- 18: Generated a response from the model and obtained the latency by calculating the time required to generate the response.
Line 21–23: Obtained the total length of tokens in the response generated, divided it by the latency and calculated the throughput.
This model was run on an A1000 (16GB GPU), and it achieves a latency of 2.7 seconds and a throughput of 32 tokens/second.
vLLM is an open source LLM library for serving Large Language Models at low latency and high throughput.
How vLLM works
The transformer is the building block of Large Language Models. The transformer network uses a mechanism called the attention mechanism, which is used by the network to study and understand the context of words. The attention mechanism is made up of a bunch of mathematical calculations of matrices known as attention keys and values. The memory used by the interaction of these attention keys and values affects the speed of the model. vLLM introduced a new attention mechanism called PagedAttention that efficiently manages the allocation of memory for the transformer’s attention keys and values during the generation of tokens. The memory efficiency of vLLM has proven very useful in running Large Language Models at low latency and high throughput.
This is a high-level explanation of how vLLM works. To learn more in-depth technical details, visit the vLLM documentation.
Install vLLM
pip3 install vllm==0.3.3
Run Phi-2 with vLLM
Generated Output
Latency: 1.218436622619629seconds
Throughput: 63.15334836428132tokens/second
[1, 2, 3, 4, 5]
A: def sum_list(numbers):
total = 0
for num in numbers:
total += num
return totalnumbers = [1, 2, 3, 4, 5]
print(sum_list(numbers))
Step By Step Code Breakdown
Line 1–3: Imported required packages from vLLM for running Phi-2.
Line 5–8: Loaded Phi-2 with vLLM, defined the prompt and set important parameters for running the model.
Line 10–16: Generated the model’s response using llm.generate and computed the latency.
Line 19–21: Obtained the length of total tokens generated from the response, divided the length of tokens by the latency to get the throughput.
Line 23–24: Obtained the generated text.
I ran Phi-2 with vLLM on the same prompt, “Generate a python code that accepts a list of numbers and returns the sum.” On the same GPU, an A1000 (16GB GPU), vLLM produces a latency of 1.2 seconds and a throughput of 63 tokens/second, compared to Hugging Face transformers’ latency of 2.85 seconds and a throughput of 32 tokens/second. Running a Large Language Model with vLLM produces the same accurate result as using Hugging Face, with much lower latency and higher throughput.
Note: The metrics (latency and throughput) I obtained for vLLM are estimated benchmarks for vLLM performance. The model generation speed depends on many factors, such as the length of the input prompt and the size of the GPU. According to the official vLLM report, running an LLM model on a powerful GPU like the A100 in a production setting with vLLM achieves 24x higher throughput than Hugging Face Transformers.
Benchmarking Latency and Throughput in Real Time
The way I calculated the latency and throughput for running Phi-2 is experimental, and I did this to explain how vLLM accelerates a Large Language Model’s performance. In the real-world use case of LLMs, such as a chat-based system where the model outputs a token as it is generated, measuring the latency and throughput is more complex.
A chat-based system is based on streaming output tokens. Some of the major factors that affect the LLM metrics are Time to First Token (the time required for a model to generate the first token), Time Per Output Token (the time spent per output token generated), the input sequence length, the expected output, the total expected output tokens, and the model size. In a chat-based system, the latency is usually a combination of Time to First Token and Time Per Output Token multiplied by the total expected output tokens.
The longer the input sequence length passed into a model, the slower the response. Some of the approaches used in running LLMs in real-time involve batching users’ input requests or prompts to perform inference on the requests concurrently, which helps in improving the throughput. Generally, using a powerful GPU and serving LLMs with efficient tools like vLLM improves both the latency and throughput in real-time.
Run the vLLM deployment on Google Colab
Quantization is the conversion of a machine learning model from a higher precision to a lower precision by shrinking the model’s weights into smaller bits, usually 8-bit or 4-bit. Deployment tools like vLLM are very useful for inference serving of Large Language Models at very low latency and high throughput. We are able to run Phi-2 with Hugging Face and vLLM conveniently on the T4 GPU on Google Colab because it is a smaller LLM with 2.7 billion parameters. For example, a 7-billion-parameter model like Mistral 7B cannot be run on Colab with either Hugging Face or vLLM. Quantization is best for managing GPU hardware requirements for Large Language Models. When GPU availability is limited and we need to run a very large Language Model, quantization is the best approach to load LLMs on constrained devices.
BitsandBytes
It is a python library built with custom quantization functions for shrinking model’s weights into lower bits(8-bit and 4-bit).
Install BitsandBytes
pip3 install bitsandbytes
Quantization of Mistral 7B Model
Mistral 7B, a 7-billion-parameter model from MistralAI, is one of the best state-of-the-art open-source Large Language Models. I will go through a step-by-step process of running Mistral 7B with different quantization techniques that can be run on the T4 GPU on Google Colab.
Quantization with 8bit Precision: This is the conversion of a machine learning model’s weight into 8-bit precision. BitsandBytes has been integrated with Hugging Face transformers to load a language model using the same Hugging Face code, but with minor modifications for quantization.
Line 1: Imported the needed packages for running model, including the BitsandBytesConfig library.
Line 3–4: Defined the quantization config and set the parameter load_in_8bit to true for loading the model’s weights in 8-bit precision.
Line 7–9: Passed the quantization config into the function for loading the model, set the parameter device_map for bitsandbytes to automatically allocate appropriate GPU memory for loading the model. Finally loaded the tokenizer weights.
Quantization with 4bit Precision: This is the conversion of a machine learning model’s weight into 4-bit precision.
The code for loading Mistral 7B in 4-bit precision is similar to that of 8-bit precision except for a few changes:
- changed load_in_8bit to load_in_4bit.
- A new parameter bnb_4bit_compute_dtype is introduced into the BitsandBytesConfig to perform the model’s computation in bfloat16. bfloat16 is computation data type for loading model’s weights for faster inference. It can be used with both 4-bit and 8-bit precisions. If it is in 8-bit you just need to change the parameter from bnb_4bit_compute_dtype to bnb_8bit_compute_dtype.
NF4(4-bit Normal Float) and Double Quantization
NF4 (4-bit Normal Float) from QLoRA is an optimal quantization approach that yields better results than the standard 4-bit quantization. It is integrated with double quantization, where quantization occurs twice; quantized weights from the first stage of quantization are passed into the next stage of quantization, yielding optimal float range values for the model’s weights. According to the report from the QLoRA paper, NF4 with double quantization does not suffer from a drop in accuracy performance. Read more in-depth technical details about NF4 and Double Quantization from the QLoRA paper:
Line 4–9: Extra parameters were set the BitsandBytesConfig:
- load_4bit: loading model in 4-bit precision is set to true.
- bnb_4bit_quant_type: The type of quantization is set to nf4.
- bnb_4bit_use_double_quant: Double quantization is set to True.
- bnb_4_bit_compute_dtype: bfloat16 computation data type is used for faster inference.
Line 11–13: Loaded the model’s weights and tokenizer.
Full Code for Model Quantization
Generated Output
<s> [INST] What is Natural Language Processing? [/INST] Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and
computer science that deals with the interaction between computers and human language. Its main objective is to read, decipher,
understand, and make sense of the human language in a valuable way. It can be used for various tasks such as speech recognition,
text-to-speech synthesis, sentiment analysis, machine translation, part-of-speech tagging, name entity recognition,
summarization, and question-answering systems. NLP technology allows machines to recognize, understand,
and respond to human language in a more natural and intuitive way, making interactions more accessible and efficient.</s>
Quantization is a very good approach for optimizing the running of very Large Language Models on smaller GPUs and can be applied to any model, such as Llama 70B, Falcon 40B, and mpt-30b. According to reports from the LLM.int8 paper, very Large Language Models suffer less from accuracy drops when quantized compared to smaller ones. Quantization is best applied to very Large Language Models and does not work well for smaller models because of the loss in accuracy performance.
Run Mixtral 7B Quantization on Google Colab
Conclusion
In this article, I provided a step-by-step approach to measuring the speed performance of a Large Language Model, explained how vLLM works, and how it can be used to improve the latency and throughput of a Large Language Model. Finally, I explained quantization and how it is used to load Large Language Models on small-scale GPUs.
Reach to me via:
Email: olafenwaayoola@gmail.com
Linkedin: https://www.linkedin.com/in/ayoola-olafenwa-003b901a9/
References