EPFL researchers, in collaboration with Apple, have introduced a new approach to speculative sampling called Parallel Speculative Sampling (PaSS). This new approach allows for the drafting of multiple tokens simultaneously using a single model, combining the benefits of auto-regressive generation and speculative sampling. The PaSS method was evaluated on text and code completion tasks, exhibiting promising performance without compromising model quality. The team also explored the impact of the number of look-ahead embeddings on the approach, discovering an optimal number for achieving the best results.
PaSS addresses the limitations of speculative sampling, requiring two models with the same tokenizer, by enabling the drafting of multiple tokens in parallel with a single model. Comparative evaluations with autoregressive generation and a baseline method demonstrate PaSS’s superior speed and performance. Testing on text and code completion tasks yields promising results without compromising overall model quality. It also explores the impact of sampling schemes and look-ahead embeddings on PaSS performance.
Large language models face limitations in natural language processing due to the auto-regressive generation, requiring a forward pass for each generated token and impacting memory access and processing time. Speculative sampling offers a solution but requires two models with the same tokenizer, introducing bottlenecks. PaSS is an alternative that enables drafting multiple tokens with a single model, eliminating the need for a second model.
The proposed method utilizes parallel decoding, which eliminates the need for a second model and involves two phases: drafting and validation. During the drafting phase, the model simultaneously produces multiple tokens using parallel decoding, with the first token being excluded from the draft for distribution matching in case of rejection. This approach achieves superior speed and performance while maintaining overall model quality.
The PaSS method was found to be an effective way of generating language models with a significant speed-up of up to 30% compared to auto-regressive generation, while maintaining model performance within the margin of error. PaSS was also shown to generate tokens with lower variance and higher predictability, as demonstrated in comparison with baselines using different sampling schemes. The study also found that the number of look-ahead steps steadily impacted PaSS performance, with a decrease in running time up to 6 look-ahead steps.
PaSS is a powerful language model generation technique that utilizes a parallel drafting approach for token decoding with fine-tuned look-ahead embeddings. Its effectiveness in generating tokens with low variance and high predictability has been proven through evaluations for text and code completion tasks. Further improvements are being aimed for through look-ahead tickets to enhance performance even more.
Future research directions recommend exploring methods to enhance the quality of parallel generation with look-ahead tokens, considering it a promising avenue for improving PaSS performance. The researchers emphasize the need for further investigation into the impact of the number of look-ahead steps on PaSS, as an increased number of steps might potentially negate the approach’s benefits.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.