For a moment, imagine an airplane. What springs to mind? Now imagine a Boeing 737 and a V-22 Osprey. Both are aircraft designed to move cargo and people, yet they serve different purposes — one more general (commercial flights and freight), the other very specific (infiltration, exfiltration, and resupply missions for special operations forces). They look far different because they are built for different activities.
With the rise of LLMs, we have seen our first truly general-purpose ML models. Their generality helps us in so many ways:
- The same engineering team can now do sentiment analysis and structured data extraction
- Practitioners in many domains can share knowledge, making it possible for the whole industry to benefit from each other’s experience
- There is a wide range of industries and jobs where the same experience is useful
But as we see with aircraft, generality requires a very different assessment from excelling at a particular task, and at the end of the day business value often comes from solving particular problems.
This is a good analogy for the difference between model and task evaluations. Model evals are focused on overall general assessment, but task evals are focused on assessing performance of a particular task.
The term LLM evals is thrown around quite generally. OpenAI released some tooling to do LLM evals very early, for example. Most practitioners are more concerned with LLM task evals, but that distinction is not always clearly made.
What’s the Difference?
Model evals look at the “general fitness” of the model. How well does it do on a variety of tasks?
Task evals, on the other hand, are specifically designed to look at how well the model is suited for your particular application.
Someone who works out generally and is quite fit would likely fare poorly against a professional sumo wrestler in a real competition, and model evals can’t stack up against task evals in assessing your particular needs.
Model evals are specifically meant for building and fine-tuning generalized models. They are based on a set of questions you ask a model and a set of ground-truth answers that you use to grade responses. Think of taking the SATs.
While every question in a model eval is different, there is usually a general area of testing. There is a theme or skill each metric is specifically targeted at. For example, HellaSwag performance has become a popular way to measure LLM quality.
The HellaSwag dataset consists of a collection of contexts and multiple-choice questions where each question has multiple potential completions. Only one of the completions is sensible or logically coherent, while the others are plausible but incorrect. These completions are designed to be challenging for AI models, requiring not just linguistic understanding but also common sense reasoning to choose the correct option.
Here is an example:
A tray of potatoes is loaded into the oven and removed. A large tray of cake is flipped over and placed on counter. a large tray of meat
A. is placed onto a baked potato
B. ls, and pickles are placed in the oven
C. is prepared then it is removed from the oven by a helper when done.
Another example is MMLU. MMLU features tasks that span multiple subjects, including science, literature, history, social science, mathematics, and professional domains like law and medicine. This diversity in subjects is intended to mimic the breadth of knowledge and understanding required by human learners, making it a good test of a model’s ability to handle multifaceted language understanding challenges.
Here are some examples — can you solve them?
For which of the following thermodynamic processes is the increase in the internal energy of an ideal gas equal to the heat added to the gas?
A. Constant Temperature
B. Constant Volume
C. Constant Pressure
D. Adiabatic
The Hugging Face Leaderboard is perhaps the best known place to get such model evals. The leaderboard tracks open source large language models and keeps track of many model evaluation metrics. This is typically a great place to start understanding the difference between open source LLMs in terms of their performance across a variety of tasks.
Multimodal models require even more evals. The Gemini paper demonstrates that multi-modality introduces a host of other benchmarks like VQAv2, which tests the ability to understand and integrate visual information. This information goes beyond simple object recognition to interpreting actions and relationships between them.
Similarly, there are metrics for audio and video information and how to integrate across modalities.
The goal of these tests is to differentiate between two models or two different snapshots of the same model. Picking a model for your application is important, but it is something you do once or at most very infrequently.
The much more frequent problem is one solved by task evaluations. The goal of task-based evaluations is to analyze the performance of the model using LLM as a judge.
- Did your retrieval system fetch the right data?
- Are there hallucinations in your responses?
- Did the system answer important questions with relevant answers?
Some may feel a bit unsure about an LLM evaluating other LLMs, but we have humans evaluating other humans all the time.
The real distinction between model and task evaluations is that for a model eval we ask many different questions, but for a task eval the question stays the same and it is the data we change. For example, say you were operating a chatbot. You could use your task eval on hundreds of customer interactions and ask it, “Is there a hallucination here?” The question stays the same across all the conversations.
There are several libraries aimed at helping practitioners build these evaluations: Ragas, Phoenix (full disclosure: the author leads the team that developed Phoenix), OpenAI, LlamaIndex.
How do they work?
The task eval grades performance of every output from the application as a whole. Let’s look at what it takes to put one together.
Establishing a benchmark
The foundation rests on establishing a robust benchmark. This starts with creating a golden dataset that accurately reflects the scenarios the LLM will encounter. This dataset should include ground truth labels — often derived from meticulous human review — to serve as a standard for comparison. Don’t worry, though, you can usually get away with dozens to hundreds of examples here. Selecting the right LLM for evaluation is also critical. While it may differ from the application’s primary LLM, it should align with goals of cost-efficiency and accuracy.
Crafting the evaluation template
The heart of the task evaluation process is the evaluation template. This template should clearly define the input (e.g., user queries and documents), the evaluation question (e.g., the relevance of the document to the query), and the expected output formats (binary or multi-class relevance). Adjustments to the template may be necessary to capture nuances specific to your application, ensuring it can accurately assess the LLM’s performance against the golden dataset.
Here is an example of a template to evaluate a Q&A task.
You are given a question, an answer and reference text. You must determine whether the given answer correctly answers the question based on the reference text. Here is the data:
[BEGIN DATA]
************
[QUESTION]: {input}
************
[REFERENCE]: {reference}
************
[ANSWER]: {output}
[END DATA]
Your response should be a single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the answer.
Metrics and iteration
Running the eval across your golden dataset allows you to generate key metrics such as accuracy, precision, recall, and F1-score. These provide insight into the evaluation template’s effectiveness and highlight areas for improvement. Iteration is crucial; refining the template based on these metrics ensures the evaluation process remains aligned with the application’s goals without overfitting to the golden dataset.
In task evaluations, relying solely on overall accuracy is insufficient since we always expect significant class imbalance. Precision and recall offer a more robust view of the LLM’s performance, emphasizing the importance of identifying both relevant and irrelevant outcomes accurately. A balanced approach to metrics ensures that evaluations meaningfully contribute to enhancing the LLM application.
Application of LLM evaluations
Once an evaluation framework is in place, the next step is to apply these evaluations directly to your LLM application. This involves integrating the evaluation process into the application’s workflow, allowing for real-time assessment of the LLM’s responses to user inputs. This continuous feedback loop is invaluable for maintaining and improving the application’s relevance and accuracy over time.
Evaluation across the system lifecycle
Effective task evaluations are not confined to a single stage but are integral throughout the LLM system’s life cycle. From pre-production benchmarking and testing to ongoing performance assessments in production, LLM evaluation ensures the system remains responsive to user need.
Example: is the model hallucinating?
Let’s look at a hallucination example in more detail.
Since hallucinations are a common problem for most practitioners, there are some benchmark datasets available. These are a great first step, but you will often need to have a customized dataset within your company.
The next important step is to develop the prompt template. Here again a good library can help you get started. We saw an example prompt template earlier, here we see another specifically for hallucinations. You may need to tweak it for your purposes.
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information, you
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the reference text
contains factual information and is not a hallucination. A 'hallucination' in this context refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not
contain made up information. Please read the query and reference text carefully before determining
your response.[BEGIN DATA]
************
[Query]: {input}
************
[Reference text]: {reference}
************
[Answer]: {output}
************
[END DATA]
Is the answer above factual or hallucinated based on the query and reference text?
Your response should be a single word: either "factual" or "hallucinated", and it should not include any other text or characters.
"hallucinated" indicates that the answer provides factually inaccurate information to the query based on the reference text.
"factual" indicates that the answer to the question is correct relative to the reference text, and does not contain made up information.
Please read the query and reference text carefully before determining your response.
Now you are ready to give your eval LLM the queries from your golden dataset and have it label hallucinations. When you look at the results, remember that there should be class imbalance. You want to track precision and recall instead of overall accuracy.
It is very useful to construct a confusion matrix and plot it visually. When you have such a plot, you can feel reassurance about your LLM’s performance. If the performance is not to your satisfaction, you can always optimize the prompt template.
After the eval is built, you now have a powerful tool that can label all your data with known precision and recall. You can use it to track hallucinations in your system both during development and production phases.
Let’s sum up the differences between task and model evaluations.
Ultimately, both model evaluations and task evaluations are important in putting together a functional LLM system. It is important to understand when and how to apply each. For most practitioners, the majority of their time will be spent on task evals, which provide a measure of system performance on a specific task.