A deep dive into stochastic decoding with temperature, top_p, top_k, and min_p
When you ask a Large Language Model (LLM) a question, the model outputs a probability for every possible token in its vocabulary.
After sampling a token from this probability distribution, we can append the selected token to our input prompt so that the LLM can output the probabilities for the next token.
This sampling process can be controlled by parameters such as the famous temperature
and top_p
.
In this article, I will explain and visualize the sampling strategies that define the output behavior of LLMs. By understanding what these parameters do and setting them according to our use case, we can improve the output generated by LLMs.
For this article, I’ll use VLLM as the inference engine and Microsoft’s new Phi-3.5-mini-instruct model with AWQ quantization. To run this model locally, I’m using my laptop’s NVIDIA GeForce RTX 2060 GPU.
Table Of Contents
· Understanding Sampling With Logprobs
∘ LLM Decoding Theory
∘ Retrieving Logprobs With the OpenAI Python SDK
· Greedy Decoding
· Temperature
· Top-k Sampling
· Top-p Sampling
· Combining Top-p…