I often see data scientists getting interested in the development of LLMs in terms of model architecture, training techniques or data collection. However, I have noticed that many times, outside the theoretical aspect, in many people have problems in serving these models in a way that they can actually be used by users.
In this brief tutorial, I thought I would show in a very simple way how you can serve an LLM, specifically llama-3, using BentoML.
BentoML is an end-to-end solution for machine learning model serving. It facilitates Data Science teams to develop production-ready model serving endpoints, with DevOps best practices and performance optimization at every stage.
We need GPU
As you know in Deep Learning having the right hardware available is critical. Especially for very large models like LLMs, this becomes even more important. Unfortunately, I don’t have any GPU 😔
That’s why I rely on external providers, so I rent one of their machines and work there. I chose for this article to work on Runpod because I know their services and I think it is an affordable price to follow this tutorial. But if you have GPUs available or want to…