You can find the code in this GitHub repo:
https://github.com/amirarsalan90/personal_llm_assistant
The main components of the app include:
Llama-cpp-python is a python binding for the great llama.cpp , which implements many Large Language Models in C/C++ . Because of its wide adoption by open-source community, I decided to use it in this tutorial.
Note: I have tested this app on a system with Nvidia RTX4090 gpu.
First thing first, lets create a new conda environment:
conda create --name assistant python=3.10
conda activate assistant
Next we need to install llama-cpp-python. As mentioned in llama-cpp-python descriptions, llama.cpp supports a number of hardware acceleration backends to speed up inference. In order to leverage the GPU and run the LLM on GPU, we will build the program with CUBLAS. I had some issues with getting to offload the model on GPU and I finally found this post on how to properly install:
export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
pip install llama-cpp-python[server]