When it comes to evaluating a RAG system without ground truth data, another approach is to create your own dataset. It sounds daunting, but there are several strategies to make this process easier, from finding similar datasets to leveraging human feedback and even synthetically generating data. Let’s break down how you can do it.
Finding Similar Datasets Online
This might seem obvious, and most people who have come to the conclusion that they don’t a have ground truth dataset have already exhausted this option. But it’s still worth mentioning that there might be datasets out there that are similar to what you need. Perhaps it’s in a different business domain from your use case but it’s in the question-answer format that you’re working with. Sites like Kaggle have a huge variety of public datasets, and you might be surprised at how many align with your problem space.
Example:
Manually Creating Ground Truth Data
If you can’t find exactly what you need online, you can always create ground truth data manually. This is where human-in-the-loop feedback comes in handy. Remember the domain expert feedback we talked about earlier? You can use that feedback to build your own mini-dataset.
By curating a collection of human-reviewed examples — where the relevance, correctness, and completeness of the results have been validated — you create a foundation for expanding your dataset for evaluation.
There is also a great article from Katherine Munro on an experimental approach to agile chatbot development.
Training an LLM as a Judge
Once you have your minimal ground truth dataset, you can take things a step further by training an LLM to act as a judge and evaluate your model’s outputs.
But before relying on an LLM to act as a judge, we first need to ensure that it’s rating our model outputs accurately, or at least reliable. Here’s how you can approach that:
- Build human-reviewed examples: Depending on your use case, 20 to 30 examples should be good enough to get a good sense of how reliable the LLM is in comparison. Refer to the previous section on best criteria to rate and how to measure conflicting ratings.
- Create Your LLM Judge: Prompt an LLM to give ratings based on the same criteria that you handed to your domain experts. Take the rating and compare how the LLM’s ratings align with the human ratings. Again, you can use metrics like Pearson metrics to help evaluate. A high correlation score will indicate that the LLM is performing as well as a judge.
- Apply prompt engineering best practices: Prompt engineering can make or break this process. Techniques like pre-warming the LLM with context or providing a few examples (few-shot learning) can dramatically improve the models’ accuracy when judging.
Another way to boost the quality and quantity of your ground truth datasets is by segmenting your documents into topics or semantic groupings. Instead of looking at entire documents as a whole, break them down into smaller, more focused segments.
For example, let’s say you have a document (documentId: 123) that mentions:
“After launching product ABC, company XYZ saw a 10% increase in revenue for 2024 Q1.”
This one sentence contains two distinct pieces of information:
- Launching product ABC
- A 10% increase in revenue for 2024 Q1
Now, you can augment each topic into its own query and context. For example:
- Query 1: “What product did company XYZ launch?”
- Context 1: “Launching product ABC”
- Query 2: “What was the change in revenue for Q1 2024?”
- Context 2: “Company XYZ saw a 10% increase in revenue for Q1 2024”
By breaking the data into specific topics like this, you not only create more data points for training but also make your dataset more precise and focused. Plus, if you want to trace each query back to the original document for reliability, you can easily add metadata to each context segment. For instance:
- Query 1: “What product did company XYZ launch?”
- Context 1: “Launching product ABC (documentId: 123)”
- Query 2: “What was the change in revenue for Q1 2024?”
- Context 2: “Company XYZ saw a 10% increase in revenue for Q1 2024 (documentId: 123)”
This way, each segment is tied back to its source, making your dataset even more useful for evaluation and training.
If all else fails, or if you need more data than you can gather manually, synthetic data generation can be a game-changer. Using techniques like data augmentation or even GPT models, you can create new data points based on your existing examples. For instance, you can take a base set of queries and contexts and tweak them slightly to create variations.
For example, starting with the query:
- “What product did company XYZ launch?”
You could synthetically generate variations like:
- “Which product was introduced by company XYZ?”
- “What was the name of the product launched by company XYZ?”
This can help you build a much larger dataset without the manual overhead of writing new examples from scratch.
There are also frameworks that can automate the process of generating synthetic data for you that we’ll explore in the last section.
Now that you’ve gathered or created your dataset, it’s time to dive into the evaluation phase. RAG model involves two key areas: retrieval and generation. Both are important and understanding how to assess each will help you fine-tune your model to better meet your needs.
Evaluating Retrieval: How Relevant is the Retrieved Data?
The retrieval step in RAG is crucial — if your model can’t pull the right information, it’s going to struggle with generating accurate responses. Here are two key metrics you’ll want to focus on:
- Context Relevancy: This measures how well the retrieved context aligns with the query. Essentially, you’re asking: Is this information actually relevant to the question being asked? You can use your dataset to calculate relevance scores, either by human judgment or by comparing similarity metrics (like cosine similarity) between the query and the retrieved document.
- Context Recall: Context recall focuses on how much relevant information was retrieved. It’s possible that the right document was pulled, but only part of the necessary information was included. To evaluate recall, you need to check whether the context your model pulled contains all the key pieces of information to fully answer the query. Ideally, you want high recall: your retrieval should grab the information you need and nothing critical should be left behind.
Evaluating Generation: Is the Response Both Accurate and Useful?
Once the right information is retrieved, the next step is generating a response that not only answers the query but does so faithfully and clearly. Here are two critical aspects to evaluate:
- Faithfulness: This measures whether the generated response accurately reflects the retrieved context. Essentially, you want to avoid hallucinations — where the model makes up information that wasn’t in the retrieved data. Faithfulness is about ensuring that the answer is grounded in the facts presented by the documents your model retrieved.
- Answer Relevancy: This refers to how well the generated answer matches the query. Even if the information is faithful to the retrieved context, it still needs to be relevant to the question being asked. You don’t want your model to pull out correct information that doesn’t quite answer the user’s question.
Doing a Weighted Evaluation
Once you’ve assessed both retrieval and generation, you can go a step further by combining these evaluations in a weighted way. Maybe you care more about relevancy than recall, or perhaps faithfulness is your top priority. You can assign different weights to each metric depending on your specific use case.
For example:
- Retrieval: 60% context relevancy + 40% context recall
- Generation: 70% faithfulness + 30% answer relevancy
This kind of weighted evaluation gives you flexibility in prioritizing what matters most for your application. If your model needs to be 100% factually accurate (like in legal or medical contexts), you may put more weight on faithfulness. On the other hand, if completeness is more important, you might focus on recall.
If creating your own evaluation system feels overwhelming, don’t worry — there are some great existing frameworks that have already done a lot of the heavy lifting for you. These frameworks come with built-in metrics designed specifically to evaluate RAG systems, making it easier to assess retrieval and generation performance. Let’s look at a few of the most helpful ones.
RAGAS is a purpose-built framework designed to assess the performance of RAG models. It includes metrics that evaluate both retrieval and generation, offering a comprehensive way to measure how well your system is doing at each step. It also offers synthetic test data generation by employing an evolutionary generation paradigm.
Inspired by works like Evol-Instruct, Ragas achieves this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents. — RAGAS documentation
ARES is another powerful tool that combines synthetic data generation with LLM-based evaluation. ARES uses synthetic data — data generated by AI models rather than collected from real-world interactions — to build a dataset that can be used to test and refine your RAG system.
The framework also includes an LLM judge, which, as we discussed earlier, can help evaluate model outputs by comparing them to human annotations or other reference data.
Even without ground truth data, these strategies can help you effectively evaluate a RAG system. Whether you’re using vector similarity thresholds, multiple LLMs, LLM-as-a-judge, retrieval metrics, or frameworks, each approach gives you a way to measure performance and improve your model’s results. The key is finding what works best for your specific needs — and not being afraid to tweak things along the way. 🙂