Disclosure: I am a Data Engineer with Singapore’s Government Technology Agency (GovTech) Data Science and Artificial Intelligence Division (DSAID). As one of the key developers working on Analytics.gov, I work with agencies across the entire public sector to develop Data Science and AI/ML capabilities for public good.
- Preamble
- Why use open-source models?
- Blockers for Hosting Open-source LLMs
- What is quantisation and how can it help?
- How do AWS Sagemaker Endpoints work?
- Hosting a Quantised Model in AG Sagemaker
- Benchmarks
- Conclusion
If you haven’t read our previous publications, you can peruse them here!
Analytics.gov (AG), developed by GovTech Singapore’s Data Science and Artificial Intelligence Division (DSAID), is a Central Machine Learning Operations (MLOps) platform that productionises ML and AI use cases for the Whole-of-Government (WOG). Hosted on Government Commercial Cloud (GCC) 2.0, it utilises best-practice network and security configurations to provide a safe and secure environment for all data science and AI needs. Through AG, government officers are able to access compute resources, managed AI services and other utilities directly from their government issued laptops without the need for managing or developing new infrastructure, thereby fast-tracking AI/ML initiatives across the whole of government.
AG provides custom functionalities to create and manage production-ready inference endpoints for quantised models through the capabilities offered by AWS Sagemaker Endpoints. With just a few lines of code, end users can quickly set up their own private inference endpoints for quantised models, reducing what could have taken days or weeks of work into mere minutes. This substantially lowers the barrier of entry for agencies across the whole of government to leverage the power of GenAI with greater efficiency and cost-effectiveness.
In this article, we will explore how AG enables government agencies to run LLMs efficiently and cost-effectively. Our goal is to demystify model quantisation, illustrate how we streamlined the process of hosting quantised open-source LLMs in AWS Sagemaker, and provide benchmarks to gauge the gains in performance and cost-efficiency.
For a brilliant read on Open LLMs, please view Sau Sheong’s publication here! (Note: its a medium member-only story)
I highly recommend it, as it sheds light on hosting open-source LLMs as APIs, providing a great complement to this article.
Security & Sensitivity
Open-source models can be hosted privately on your own devices or cloud environments, meaning that queries to your model do not get sent to third-party providers. This is particularly crucial with government data, as a large majority of it contains sensitive information.
Controlled Output Generation
Usage of open-sourced models can be controlled on a more granular level. Closed-sourced models have to be interfaced via exposed commercial APIs which abstracts out complexity but reduces the degree of control over the model. Locally hosted open-sourced models allow for full control over the output generation, this is important as many useful libraries such as LMQL and Guidance work better with locally hosted models.
Variety
As of writing, there are over 600k models in HuggingFace, including the models posted by major players such as Meta and Google and individual contributors who publish their own variants. Some variants are fine-tuned for specific purposes/tasks, which can be used out of the box. Users can simply reuse these models instead of fine-tuning their own.
For example, AiSingapore’s SEA-LION model is instruct-tuned for the Southeast Asia (SEA) region languages, where its training dataset consists of diverse languages from Malay to Thai. Utilising this model would save the effort in obtaining large amounts of datasets in different languages and computational cost of fine-tuning.
Language Models come in many shapes and sizes, popular models range from TinyLlama (1.1B) to the upcoming Llama-3 400B+. While Small Language Models (SLM) like TinyLlama works well for smaller and more straightforward use cases, complex use cases usually require the “smarter” Large Language Models (LLM). It goes without saying that all GenAI applications would benefit from having better output quality from the larger LLMs, however with extra size also comes with extra tradeoffs.
To maximise the speed of inference, models have to be fully loaded in GPU memory as any movement between disk and GPU memory or CPU and GPU memory would introduce overheads that can substantially slow down inference speeds.
LLMs require massive amounts of memory to host, the bigger the LLM, the more GPU memory is required to host it. Most large models demand multiple GPUs to fully host in memory, making it an extremely resource intensive and expensive task.
Naturally, as the size of the model increases, more computation is required for each inference task. Consequently, the larger the LLMs, the lower the inference speed.
Just how big are these models?
The size of these LLMs can be estimated with the following formula (Note, this is a naive estimation and model sizes are almost always slightly larger.)
Using the formula we can estimate the model size for some popular models:
Note: The formula merely estimates the model size, real world GPU requirements will certainly be much larger and are different depending on other factors. (As you will see in the later section on Benchmarks, the actual GPU requirements completely blows these estimates out of the water). “BF16” stands for the number format brain float 16, while “FP16” stands for floating point 16.
The upcoming Meta’s Llama-3 400B+ will be one of the biggest open-source models available when it is released. We can estimate that this beast would be as big as 800 GB. For context, 800 GB would require at least 10 x A100 80GB GPU cards to host even if we naively assume zero hosting overheads.
Another popular but more reasonably sized model — Llama-3 70B published at bf16 or 16 bits per weight (bpw) precision, would still require 141.2 GB of GPU memory to host for inference.
Why is Large GPU Memory Requirements an issue?
As GPUs are in short supply and high demand currently, it’s not easy to find multiple GPU chips for cheap. Hosting LLMs in their raw and unquantised format can thus be a very expensive business that is only available to the privileged few that can afford it. This can be limiting for projects that require the wisdom of LLMs but is not valuable enough to warrant the use of multiple scarce and expensive GPUs.
Slower inference speeds from larger LLM sizes also results in:
- Worse user experience due to slow output.
- Reduced total possible throughput that can be extracted by downstream applications. For applications that are heavy on token usage such as text-summarisation or report generation, the reduced throughput can seriously hurt the viability of the application.
Slow inference and expensive costs are debilitating factors for production-grade use cases, hence each GenAI application will need to make the tradeoff between output quality, inference speed and cost.
What is Quantisation?
For a more rigorous explanation on Quantisation, please refer to to these two fantastic guides: https://www.tensorops.ai/post/what-are-quantized-llms, https://www.semianalysis.com/p/neural-network-quantization-and-number
For simplicity, the following section will only refers to Post-Training Quantisation (PTQ)
In simple terms, in the domain of AI/ML, Quantisation is a technique for reducing the size of a model. Underneath the hood, model weights are stored as numbers. Typically these weights are stored in number formats like floating point 16 (FP16) or brain float 16 (BF16), which as the name suggests, takes 16 bits to store a number.
Quantisation reduces the number of bits required to store each number, this allows the storage size of the model to be reduced as less bits are used to store each model weight.
However, using fewer bits per weight means the precision of the weights is reduced. This is why Quantisation is aptly described by most articles as “reducing the precision of model weights”.
For visual learners here is π represented in different precisions:
You can try for yourself using this floating point calculator.
Note: Modern quantisation methods may use bespoke number formats rather than FP series to quantise models. These can go as low as 1 bit quantisation (Q1).
As seen in the table, the precision of π is reduced as the number of bits decreases. This not only affects the number of decimal places, but also in the approximation of the number itself.
For example, 3.141592502593994 cannot be represented exactly in FP8, so it has to be rounded off to the nearest possible value that FP8 can represent — 3.125, this is also known as Floating Point Error.
How does it help?
As the number of bits per weight decreases, total GPU memory requirement is also reduced. For instance, a FP16 to 8-bit Quantisation (Q8) reduces the amount of bits required to store each number from 16 bits to 8 bits. This reduces the size of the model by 50%.
To put this in an example, an unquantised FP16 Mistral 7B is estimated to be about 14.48 GB in size, while a Q8 Mistral 7B is only 7.24 GB. A Q4 Mistral 7b will only be a mere 3.62 GB, making it possible to load into some mobile devices.
Not only does reduction in memory reduce the minimum computation requirements to host a model, it also improves inference speeds.
What’s the catch?
Of course there is no free lunch in this world! Reduction in precision will impact the output quality of the model. Relating to our previous Table on Representation of π, a π represented in FP16 would probably be accurate enough for passing a math test, but a FP8 π will give you an F.
Luckily most LLMs are not too sensitive to reduction at higher precisions. As a general rule of thumb, 8-bit Quantisation or Q8 models are nearly as good as the raw ones. This is shown in the following benchmarks from “How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study”.
In short, this means that you can get a 50% reduction in model size for almost free just by quantising model weights to Q8.
For a 75% reduction in model size, i.e Q4, the model is still decent using the smarter quantisation techniques like AWQ, albeit with visible loss in quality.
Anything below Q4 and you may run into severe degradation of model output quality.
Do note that the effects of quantisation on model quality may vary from model to model. The best way to determine the best quantisation level is really based on your own usage and testing.
What Quantisation Framework to choose?
For more rigorous discourse on choosing Quantisation frameworks please see: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/ , https://www.reddit.com/r/LocalLLaMA/comments/1anb2fz/guide_to_choosing_quants_and_engines/
There are many quantisation frameworks available, some of the more popular ones are GGUF, GPTQ, EXL2 and AWQ. The best quantisation framework for you will depend on your use case. The following are my personal recommendations from what I’ve observed in my usage. What’s best for you will depend on your use case and your mileage may vary.
GGUF
Created by Georgi Gerganov with the goal of enabling LLM inference with minimal setup and state-of-the-art performance on any hardware locally or in the cloud, GGUF has become a mainstay for AI/ML enthusiasts looking to host LLMs due to its ease of use.
If you need to host models on commodity hardware or CPU only systems, then GGUF is the most suitable as it is the only framework that has CPU hosting support. GGUF also allows you to run newer models on older GPUs as well. GGUF is also the most stable framework due to how it packages the model weights as a single file in a unified format. If you need to host a quantised model reliably on any machine i.e. even your laptop, then GGUF is the way to go.
The caveat for GGUF is that it’s older quants (Qx_0) uses more simple methods of quantisation such as round-to-nearest (RTN) quantisation. This may reduce model output quality to some extent, but it’s less affected at higher quantisation levels. Newer quantisation methods in GGUF (Qx_K or IQx_S) are better at preserving model quality at lower quantisation levels.
GPTQ, EXL2 and AWQ
GPTQ, EXL2 and AWQ are specialised for GPU usage, they are all based on the GPTQ format. These frameworks tend to be much faster than GGUF as they are specially optimised for running on GPU. EXL2 allows for mixing quantisation levels within a model. AWQ tends to have the best output quality as it uses even “smarter” quantisation techniques than GPTQ. Both EXL2 and AWQ attempt to reduce degradation at lower quantisation levels. GPTQ tends to be the most supported for downstream inference engines.
In conclusion, choose GGUF for ease of hosting, EXL2 for mixed quantisation levels, AWQ for output quality and GPTQ if your choice of inference engine does not support the rest.
Now that we understand what quantisation is, how do we bring it into our users on AG’s AWS Sagemaker so that they will be able to host their own production-ready models inference endpoints for their use case?
What are Sagemaker Endpoints?
AWS Sagemaker Endpoints are the native tools within AWS Sagemaker to host model inference. Its advantages are:
- Easy to configure Auto Scaling: It only takes a few lines to add auto scaling to existing endpoints.
- Zero Downtime Updates: Updates to Sagemaker Endpoints uses BlueGreen Deployment by default.
- Flexibility & Customisation: Sagemaker Endpoints are able to use customised containers.
- Access to AWS Services: Sagemaker Endpoints are able to access AWS services like S3 which can allow for more flexibility in adding additional steps to process inference requests.
This helps to save time and expertise for users who just want to deploy a model and not think about the engineering work required to manage it on a production scale, turning what could be days/weeks of work into mere minutes.
How does Sagemaker Endpoints work?
Underneath the hood, Sagemaker Endpoints utilises special inference containers based on the Sagemaker-Inference-Toolkit library for hosting model APIs. These containers provide a quick and easy method of running inference without needing to build your own container images and supports many different frameworks from simple scikit-learn models using their scikit-learn container to even complex LLMs (and also their AWQ/GPTQ quantised variants) using the TensorRT-LLM container.
However GGUF and EXL2 quants will still require heavy customised inference frameworks. Thankfully, Sagemaker provides the flexibility to use custom containers and Sagemaker Endpoints make it very simple to do so. There are only a few details to keep in mind to make this work:
- Container must listen on port 8080.
- Container must respond to /ping and /invocations
- Container will be run with the ‘docker run <image> serve’ command, containers are expected to use ENTRYPOINT instead of CMD
- Model artifacts are brought into the ‘/opt/ml/model’ direction by specifying the S3 path to a tar.gz containing the model artifacts. This happens right before the runtime of the container.
Customise for an open-source inference engine
The above diagram represents a container pre-packed with Sagemaker-Inference-Toolkit. To use our own serving engine, we can simply replace the pre-packed packages with our own custom packages.
For instance, one of the custom containers we curated enables users to host GGUF models through using Abetlen’s Llama-cpp-python as the inference engine. This library is open-source and under the permissive MIT license.
In our dockerfile, we only needed to write a few lines of code to conform to sagemaker endpoint requirements:
- Change listening port to 8080
- Add routes for /ping and /invocations
- Run on ENTRYPOINT
Using the custom containers, hosting a quantised LLM in AG’s Sagemaker environment is reduced down to a few lines of code.
# Code will vary depending on how you have curated your own custom container.from sagemaker.model import Model
endpoint_name = "<Name of endpoint>"
image_uri = "<ECR Image URI to Llama-cpp-python Image>"
model_artifact_location = "<S3 Path to Model Artifacts>"
model = "<Path to model file>"
# All other ENV variables defined in documentation
model_endpoint = Model(
image_uri = image_uri,
model_data = model_artifact_location,
role = role,
env = {
"MODEL": model_file_path_in_container,
"N_GPU_LAYERS": "999",
"INVOCATIONS_ROUTE": "/v1/completions"
}
)
model_endpoint.deploy(
initial_instance_count=1,
instance_type="ml.g4dn.xlarge",
endpoint_name=endpoint_name
)
That’s it, short and simple. With this, our users can focus on developing their LLM use cases without being encumbered by the complexity behind the scenes.
The following are some benchmarks for the average tokens generated per second based on single query inference tested 5 times over 30 prompts i.e. each candidate is based on an average of 150 tests. For all tests, we used the CodeLlama model as it is available in many sizes, namely 7, 13, 34 and 70 billion parameters. We tested both quantised and unquantised models with different inference engines, using Transformers as the baseline as it’s typically the normal way for running unquantised models.
The following are the specifications for the benchmarking:
Note ExllamaV2 refers to the inference engine, while EXL2 is the quantisation format native to the ExllamaV2, in this case, ExllamaV2 also supports inference for GPTQ. ExllamaV2 will only be benchmarked with Q4_0 as some Q8_0 quants are not found on HuggingFace.
Unquantised via Transformers (Baseline)
BF16:
All multiples in the following tests are based on using Transformers as a baseline. For instance, the GPTQ 7b Q4_0 model has a “(3.42x)” multiple in the “Tokens per second” column, this means that GPTQ is 3.42 times as fast as the Transformers baseline for the 7b model.
GGUF via Llama-cpp-python
GGUF can support hosting on older Nvidia T4s from the g4dn instance families, so we added extra tests that optimises for cost using g4dn instance types when possible:
Q4_0
Q8_0
Using newer Nvidia A10g from the g5 instance family:
Q4_0
Q8_0
In every single case, GGUF can run the Models much cheaper or at the same price but significantly faster. For instance, the Q8 13B model is 74% faster than the baseline but at one fifth the cost!
GPTQ — Via ExllamaV2
ExllamaV2 only supports the newer hosting on the newer Nvidia A10g from the g5 instance family and not the g4dn instance family.
Q4_0
GPTQ on ExllamaV2 takes the performance improvements to a whole new level, with more than triple the speeds from the baseline for every model size quantised in Q4_0.
AWS Sagemaker Jumpstart
Natively AWS also provides a service called JumpStart that allows deployment of pretrained models with a few clicks. These AWS Sagemaker containers implement the Sagemaker Inference Toolkit and have various inference engines pre-installed. In this case, it’s using the HuggingFace’s Text Generation Inference (TGI) Framework as the inference engine.
BF16:
Notice how 13B is faster than 7B. This is because the TGI container is able to utilise more GPU memory to increase the speed of inference. On larger parameter sizes like 34B and 70B, using AWS Sagemaker Jumpstart with TGI containers can even outperform GPTQ on ExllamaV2.
Quantisation offers substantial benefits for LLMs as it reduces memory requirements for hosting them. The reduction in memory requirements increases inference speeds and reduces costs. Higher bit quantisation can be achieved with almost zero loss in output quality, substantial gains in speed and reduced cost — essentially a Pareto improvement over using unquantised LLMs.
With auxiliary functionalities provided by AG on top of AWS Sagemaker Endpoints, agencies across the entire public sector can easily access capabilities to create and manage production-ready quantised Open LLM APIs. By streamlining the process of deploying quantised large language models, AG significantly lowers the barrier of entry for producing efficient and cost-effective GenAI applications, allowing government agencies to focus on innovating and developing technology for public good.
Dovetailing with this, AG will continue to further its GenAI endeavours by providing access to closed-source models like Azure OpenAI and VertexAI’s Gemini via secured cross-cloud integration, alongside our existing services with AWS Bedrock. Through robust and comprehensive offerings, AG empowers users to rightsize models for their use cases, resulting in better, faster and cheaper GenAI applications in the public sector.
[1] Sau Sheong, Programming with AI — Open LLMs (2024), https://sausheong.com/programming-with-ai-open-llms-28091f77a088
[2] S. Stoelinga, Calculating GPU memory for serving LLMs (2023), https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/
[3] M.C. Neves, What are Quantized LLMs? (2023), https://www.tensorops.ai/post/what-are-quantized-llms
[4] D. Patel, Neural Network Quantization & Number Formats From First Principles (2024), https://www.semianalysis.com/p/neural-network-quantization-and-number
[5] W. Huang, How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study (2024), https://arxiv.org/pdf/2404.14047
[6] Oobabooga, A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time. (N.A.), https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/
[7] Sgsdxzy, Guide to choosing quants and engines (2024), https://www.reddit.com/r/LocalLLaMA/comments/1anb2fz/guide_to_choosing_quants_and_engines/
[8] Amazon Web Services, Use Your Own Inference Code with Hosting Services (N.A.), https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html