Larger language models typically deliver superior performance but at the cost of reduced inference speed. For example, Llama 2 70B significantly outperforms Llama 2 7B in downstream tasks, but its inference speed is approximately 10 times slower.
Many techniques and adjustments of decoding hyperparameters can speed up inference for very large LLMs. Speculative decoding, in particular, can be very effective in many use cases.
Speculative decoding uses a small LLM to generate the tokens which are then validated, or corrected if needed, by a much better and larger LLM. If the small LLM is accurate enough, speculative decoding can dramatically speed up inference.
In this article, I first explain how speculative decoding works. Then, I show how to run speculative decoding with different pairs of models involving Gemma, Mixtral-8x7B, Llama 2, and Pythia, all quantized. I benchmarked the inference throughput and memory consumption to highlight what configurations work the best.
Speculative decoding is presented by Google Research in this paper:
Fast Inference from Transformers via Speculative Decoding
It is a very simple and intuitive method. However, as we will see in detail in the next section, it is also difficult to make it work.
Speculative decoding runs two models during inference: the main model we want to use and a draft model. This draft model suggests the tokens during inference. Then, the main model checks the suggested tokens and corrects them if necessary. In the end, the output of speculative decoding is the same as the one that would have generated the main model alone.
Here is an illustration of speculative decoding by Google Research:
This method can dramatically accelerate inference if: