The proposal of the LLaMA suite [2] of large language models (LLMs) led to a surge in publications on the topic of open-source LLMs. In many cases, the goal of these works was to cheaply produce smaller, opens-source LLMs (for research purposes) that have comparable quality to proprietary models like ChatGPT and GPT-4. These models adopt an imitation strategy, which fine-tunes a base LLM over synthetic dialogue data from a more powerful LLM. Despite being cheap to train, these models seemed to perform comparably to proprietary LLMs like ChatGPT. As a result, the deep learning research community quickly adopted the view that open-source LLMs will rule the future — re-producing open-source variants of proprietary models was both easy and cost-effective!
“Will the most powerful LLMs be closed-source or will they be freely distributed for anyone to use, modify, and extend?” — from [1]
Unfortunately, preliminary evaluations performed on these models, which relied upon ratings provided by other LLMs (e.g., GPT-4) or human crowd workers, were somewhat cursory. Does the performance of imitation models actually match that of models like ChatGPT? To answer this question more rigorously, we will study recent research that analyzes whether imitation models truly remove the “moat” around proprietary LLMs. Interestingly, we will see that these cheap reproductions of powerful LLMs perform well in human evaluations due to their ability to learn the style of a powerful LLM. However, they lack factuality and perform poorly when subjected to more broad and targeted evaluations. In reality, imitation models do not perform nearly as well as proprietary models like ChatGPT.
“The premise of model imitation is that once a proprietary LM is made available via API, one can collect a dataset of API outputs and use it to fine-tune an open-source LM.” — from [1]