In the previous article, we discussed how to do Topic Modelling using ChatGPT and got excellent results. The task was to look at customer reviews for hotel chains and define the main topics mentioned in the reviews.
In the previous iteration, we used standard ChatGPT completions API and sent raw prompts ourselves. Such an approach works well when we are doing some ad-hoc analytical research.
However, if your team is actively using and monitoring customer reviews, it’s worth considering some automatisation. A good automatisation will not only help you build an autonomous pipeline, but it will also be more convenient (even team members unfamiliar with LLMs and coding will be able to access this data) and more cost-effective (you will send all texts to LLM and pay only once).
Suppose we are building a sustainable production-ready service. In that case, it’s worth leveraging existing frameworks to reduce the amount of glue code and have a more modular solution (so that we could easily switch, for example, from one LLM to another).
In this article, I would like to tell you about one of the most popular frameworks for LLM applications — LangChain. Also, we will understand in detail how to evaluate your model’s performance since it’s a crucial step for business applications.
Revising initial approach
First, let’s revise our previous approach for ad-hoc Topic Modelling with ChatGPT.
Step 1: Get a representative sample.
We want to determine the list of topics we will use for our markup. The most straightforward way is to send all reviews and ask LLM to define the list of 20–30 topics mentioned in our reviews. Unfortunately, we won’t be able to do it since it won’t fit the context size. We could use a map-reduce approach, but it could be costly. That’s why we would like to define a representative sample.
For this, we built a BERTopic topic model and got the most representative reviews for each topic.
Step 2: Determine the list of topics we will use for markup.
The next step is to pass all the selected texts to ChatGPT and ask it to define a list of topics mentioned in these reviews. Then, we can use these topics for later markup.
Step 3: Doing topics’ markup in batches.
The last step is the most straightforward — we can send customer reviews in batches that fit the context size and ask LLM to return topics for each customer review.
Finally, with these three steps, we could determine the list of relevant topics for our texts and classify them all.
It works perfectly for one-time research. However, we are missing some bits for an excellent production-ready solution.
From ad-hoc to production
Let’s discuss what improvements we could make to our initial ad-hoc approach.
- In the previous approach, we have a static list of topics. But in real-life examples, new topics might arise over time, for example, if you launch a new feature. So, we need a feedback loop to update the list of topics we are using. The easiest way to do it is to capture the list of reviews without any assigned topics and regularly run topic modelling on them.
- If we are doing one-time research, we can validate the results of the topics’ assignments manually. But for the process that is running in production, we need to think about a continuous evaluation.
- If we are building a pipeline for customer review analysis, we should consider more potential use cases and store other related information we might need. For example, it’s helpful to store translated versions of customer reviews so that our colleagues don’t have to use Google Translate all the time. Also, sentiment and other features (for example, products mentioned in the customer review) might be valuable for analysis and filtering.
- The LLM industry is progressing quite quickly right now, and everything is changing all the time. It’s worth considering a modular approach where we can quickly iterate and try new approaches over time without rewriting the whole service from scratch.
We have a lot of ideas on what to do with our topic modelling service. But let’s focus on the main parts: modular approach instead of API calls and evaluation. The LangChain framework will help us with both topics, so let’s learn more about it.
LangChain is a framework for building applications powered by Language Models. Here are the main components of LangChain:
- Schema is the most basic classes like Documents, Chat Messages and Texts.
- Models. LangChain provides access to LLMs, Chat Models and Text Embedding models that you could easily use in your applications and switch between them if needed. It goes without saying it supports such popular models like ChatGPT, Anthropic and Llama.
- Prompts is a functionality to help work with prompts, including prompt templates, output parsers and example selectors for few-shot prompting.
- Chains are the core of LangChain (as you might guess by the name). Chains help you to build a sequence of blocks that will be executed. You can truly appreciate this functionality if you’re building a complex application.
- Indexes: document loaders, text splitters, vector stores and retrievers. This module provides tools that help LLMs to interact with your documents. This functionality would be valuable if you’re building a Q&A use case. We won’t be using this functionality much in our example today.
- LangChain provides a whole set of methods to manage and limit memory. This functionality is primarily needed for ChatBot scenarios.
- One of the latest and most powerful features is agents. If you are a heavy ChatGPT user, you must have heard about the plugins. It’s the same idea that you can empower LLM with a set of custom or predefined tools (like Google Search or Wikipedia), and then the agent can use them while answering your questions. In this setup, LLM is acting like a reasoning agent and decides what it needs to do to achieve the result and when it gets the final answer that it could share. It’s exciting functionality, so it’s definitely worth a separate discussion.
So, LangChain can help us build modular applications and be able to switch between different components (for example, from ChatGPT to Anthropic or from CSV as data input to Snowflake DB). LangChain has more than 190 integrations, so that it can save you quite a lot of time.
Also, we could reuse ready-made chains for some use cases instead of starting from scratch.
When calling ChatGPT API manually, we have to manage quite a lot of Python glue code to make it work. It’s not a problem when you’re working on a small, straightforward task, but it might become unmanageable when you need to build something more complex and convoluted. In such cases, LangChain may help you eliminate this glue code and create more easy-to-maintain modular code.
However, LangChain has its own limitations:
- It’s primarily focused on OpenAI models, so it might not work so smoothly with on-premise open-source models.
- The flip side of convenience is that it’s not easy to understand what’s going on under the hood and when and how the ChatGPT API you’re paying for is executed. You can use debug mode, but you need to specify it and go through the complete logs for a clearer view.
- Despite pretty good documentation, I struggle from time to time to find answers to my questions. There are not so many other tutorials and resources on the internet apart from the official documentation, quite frequently you can see only official pages in Google.
- The Langchain library is progressing a lot, and the team constantly ship new features. So, the library is not mature, and you might have to switch from the functionality you’re using. For example, the
SequentialChain
class is considered legacy now and might be deprecated in the future since they’ve introduced LCEL — we will talk about it in more detail later on.
We’ve gotten a birds-eye overview of LangChain functionality, but practice makes perfect. Let’s start using LangChain.
Let’s refactor the topic assignment since it will be the most common operation in our regular process, and it will help us understand how to use LangChain in practice.
First of all, we need to install the package.
!pip install --upgrade langchain
Loading documents
To work with the customers’ reviews, we first need to load them. For that, we could use Document Loaders. In our case, customer reviews are stored as a set of .txt files in a Directory, but you can effortlessly load docs from third-party tools. For example, there’s an integration with Snowflake.
We will use DirectoryLoader
to load all files in the directory since we have separate files from hotels. For each file, we will specify TextLoader
as a loader (by default, a loader for unstructured documents is used). Our files are encoded in ISO-8859–1
, so the default call returns an error. However, LangChain can automatically detect used encoding. With such a setup, it works ok.
from langchain.document_loaders import TextLoader, DirectoryLoadertext_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader('./hotels/london', show_progress=True,
loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
docs = loader.load()
len(docs)
82
Splitting documents
Now, we would like to split our documents. We know that each file consists of a set of customer comments delimited by \n
. Since our case is very straightforward, we will use the most basic CharacterTextSplitter
that splits documents by character. When working with real documents (whole long texts instead of independent short comments), it’s better to use Recursive split by character since it allows you to split documents into chunks smarter.
However, LangChain is more suited for fuzzy text splitting. So, I had to hack it a bit to make it work the way I wanted.
How it works:
- You specify
chunk_size
andchunk_overlap
, and it tries to make the minimal number of splits so that each chunk is smaller thanchunk_size
. If it fails to create a small enough chunk, it prints a message to the Jupyter Notebook output. - If you specify too big
chunk_size
, not all comments will be separated. - If you specify too small
chunk_size
, you will have print statements for each comment in your output, leading to the Notebook reloading. Unfortunately, I couldn’t find any parameters to switch it off.
To overcome this problem, I specified length_function
as a constant equal to chunk_size
. Then I got just a standard split by character. LangChain provides enough flexibility to do what you want, but only in a somewhat hacky way.
from langchain.text_splitter import CharacterTextSplittertext_splitter = CharacterTextSplitter(
separator = "\n",
chunk_size = 1,
chunk_overlap = 0,
length_function = lambda x: 1, # usually len is used
is_separator_regex = False
)
split_docs = text_splitter.split_documents(docs)
len(split_docs)
12890
Also, let’s add the document ID to the metadata — we will use it later.
for i in range(len(split_docs)):
split_docs[i].metadata['id'] = i
The advantage of using Documents is that we now have automatic data sources and can filter data by it. For example, we can filter only comments related to Travelodge Hotel.
list(filter(
lambda x: 'travelodge' in x.metadata['source'],
split_docs
))
Next, we need a model. As we discussed earlier in LangChain, there are LLMs and Chat Models. The main difference is that LLMs take texts and return texts, while Chat Models are more suitable for conversational use cases and can get a set of messages as input. In our case, we will use the ChatModel for OpenAI since we would like to pass system messages as well.
from langchain.chat_models import ChatOpenAIchat = ChatOpenAI(temperature=0.0, model="gpt-3.5-turbo",
openai_api_key = "your_key")
Prompts
Let’s move on to the most important part — our prompt. In LangChain, there’s a concept of Prompt Templates. They help to reuse prompts parametrised by variables. It’s helpful since, in real-life applications, prompts might be very detailed and sophisticated. So, prompt templates can be a useful high-level abstraction that would help you to manage your code effectively.
Since we are going to use the Chat Model, we will need ChatPromptTemplate.
But before jumping into prompts, let’s briefly discuss a helpful feature — an output parser. Surprisingly, they can help us to create an effective prompt. We can define the desired output, generate an output parser and then use the parser to create instructions for the prompt.
Let’s define what we would like to see in the output. First, we would like to be able to pass a list of customer reviews to the prompt to process them in batches, so in the result, we would like to get a list with the following parameters:
- id to identify documents,
- list of topics from the predefined list (we will be using the list from our previous iteration),
- sentiment (negative, neutral or positive).
Let’s specify our output parser. Since we need a pretty complex JSON structure, we will use Pydantic Output Parser instead of the most commonly used Structured Output Parser.
For that, we need to create a class inherited from BaseModel
and specify all fields we need with names and descriptions (so that LLM could understand what we expect in the response).
from langchain.output_parsers import PydanticOutputParser
from langchain.pydantic_v1 import BaseModel, Field
from typing import Listclass CustomerCommentData(BaseModel):
doc_id: int = Field(description="doc_id from the input parameters")
topics: List[str] = Field(description="List of the relevant topics \
for the customer review. Please, include only topics from \
the provided list.")
sentiment: str = Field(description="sentiment of the comment (positive, neutral or negative")
output_parser = PydanticOutputParser(pydantic_object=CustomerCommentData)
Now, we could use this parser to generate formatting instructions for our prompt. That’s a fantastic case when you could use prompting best practices and spend less time on prompt engineering.
format_instructions = output_parser.get_format_instructions()
print(format_instructions)
Then, it’s time to move on to our prompt. We took a batch of comments and formatted them into the expected format. Then, we created a prompt message with a bunch of variables: topics_descr_list
, format_instructions
and input_data
. After that, we created chat prompt messages consisting of a constant system message and a prompt message. The last step is to format chat prompt messages with actual values.
from langchain.prompts import ChatPromptTemplatedocs_batch_data = []
for rec in docs_batch:
docs_batch_data.append(
{
'id': rec.metadata['id'],
'review': rec.page_content
}
)
topic_assignment_msg = '''
Below is a list of customer reviews in JSON format with the following keys:
1. doc_id - identifier for the review
2. review - text of customer review
Please, analyse provided reviews and identify the main topics and sentiment. Include only topics from the provided below list.
List of topics with descriptions (delimited with ":"):
{topics_descr_list}
Output format:
{format_instructions}
Customer reviews:
```
{input_data}
```
'''
topic_assignment_template = ChatPromptTemplate.from_messages([
("system", "You're a helpful assistant. Your task is to analyse hotel reviews."),
("human", topic_assignment_msg)
])
topics_list = '\n'.join(
map(lambda x: '%s: %s' % (x['topic_name'], x['topic_description']),
topics))
messages = topic_assignment_template.format_messages(
topics_descr_list = topics_list,
format_instructions = format_instructions,
input_data = json.dumps(docs_batch_data)
)
Now, we can pass these formatted messages to LLM and see a response.
response = chat(messages)
type(response.content)
strprint(response.content)
We got the response as a string object, but we could leverage our parser and get the list of CustomerCommentData
class objects as a result.
response_dict = list(map(lambda x: output_parser.parse(x),
response.content.split('\n')))
response_dict
So, we’ve leveraged LangChain and some of its features and have already built a bit smarter solution that could assign topics to the comments in batches (it would save us some costs) and started to define not only topics but also sentiment.
So far, we’ve built only single LLM calls without any relations and sequencing. However, in real life, we often want to split our tasks into multiple steps. For that, we can use Chains. Chain is the fundamental building block for LangChain.
LLMChain
The most basic type of chain is an LLMChain. It is a combination of LLM and prompt.
So we can rewrite our logic into a chain. This code will give us absolutely the same result as before, but it’s pretty convenient to have one method that defines it all.
from langchain.chains import LLMChaintopic_assignment_chain = LLMChain(llm=chat, prompt=topic_assignment_template)
response = topic_assignment_chain.run(
topics_descr_list = topics_list,
format_instructions = format_instructions,
input_data = json.dumps(docs_batch_data)
)
Sequential Chains
LLM chain is very basic. The power of chains is in building more complex logic. Let’s try to create something more advanced.
The idea of sequential chains is to use the output of one chain as the input for another.
For defining chains, we will be using LCEL (LangChain Expression Language). This new language was introduced just a couple of months ago, and now all the old approaches with SimpleSequentialChain
or SequentialChain
are considered legacy. So, it’s worth spending some time understanding the LCEL concept.
Let’s rewrite the previous chain in LCEL.
chain = topic_assignment_template | chat
response = chain.invoke(
{
'topics_descr_list': topics_list,
'format_instructions': format_instructions,
'input_data': json.dumps(docs_batch_data)
}
)
If you want to learn it first-hand, I suggest you watch this video about LCEL from the LangChain team.
Using sequential chains
In some cases, it might be helpful to have several sequential calls so that the output of one chain is used in the other ones.
In our case, we can first translate reviews into English and then do topic modelling and sentiment analysis.
from langchain.schema import StrOutputParser
from operator import itemgetter# translation
translate_msg = '''
Below is a list of customer reviews in JSON format with the following keys:
1. doc_id - identifier for the review
2. review - text of customer review
Please, translate review into English and return the same JSON back. Please, return in the output ONLY valid JSON without any other information.
Customer reviews:
```
{input_data}
```
'''
translate_template = ChatPromptTemplate.from_messages([
("system", "You're an API, so you return only valid JSON without any comments."),
("human", translate_msg)
])
# topic assignment & sentiment analysis
topic_assignment_msg = '''
Below is a list of customer reviews in JSON format with the following keys:
1. doc_id - identifier for the review
2. review - text of customer review
Please, analyse provided reviews and identify the main topics and sentiment. Include only topics from the provided below list.
List of topics with descriptions (delimited with ":"):
{topics_descr_list}
Output format:
{format_instructions}
Customer reviews:
```
{translated_data}
```
'''
topic_assignment_template = ChatPromptTemplate.from_messages([
("system", "You're a helpful assistant. Your task is to analyse hotel reviews."),
("human", topic_assignment_msg)
])
# defining chains
translate_chain = translate_template | chat | StrOutputParser()
topic_assignment_chain = {'topics_descr_list': itemgetter('topics_descr_list'),
'translated_data': translate_chain,
'format_instructions': itemgetter('format_instructions')}
| topic_assignment_template | chat
# execution
response = topic_assignment_chain.invoke(
{
'topics_descr_list': topics_list,
'format_instructions': format_instructions,
'input_data': json.dumps(docs_batch_data)
}
)
We similarly defined prompt templates for translation and topic assignment. Then, we determined the translation chain. The only new thing here is the usage of StrOutputParser()
, which converts response objects into strings (no rocket science).
Then, we defined the full chain, specifying the input parameters, prompt template and LLM. For input parameters, we took translated_data
from the output of translate_chain
while other parameters from the invoke input using the itemgetter
function.
However, in our case, such an approach with a combined chain might not be so convenient since we would like to save the output of the first chain as well to have translated values.
With chains, everything becomes a bit more convoluted so that we might need some debugging capabilities. There are two options for debugging.
The first one is that you can switch on debugging locally.
import langchain
langchain.debug = True
The other option is to use the LangChain platform — LangSmith. However, it’s still in beta-tester mode, so you might need to wait to get access.
Routing
One of the most complex cases of chains is routing when you use different prompts for different use cases. For example, we could save different customer review parameters depending on the sentiment:
- If the comment is negative, we will store the list of problems mentioned by the customer.
- Otherwise, we will get the list of good points from the review.
To use a routing chain, we will need to pass comments one by one instead of batching them as we did before.
So our chain on a high level will look like this.
First, we need to define the main chain that determines the sentiment. This chain consists of prompt, LLM and already familiar StrOutputParser()
.
sentiment_msg = '''
Given the customer comment below please classify whether it's negative. If it's negative, return "negative", otherwise return "positive".
Do not respond with more than one word.Customer comment:
```
{input_data}
```
'''
sentiment_template = ChatPromptTemplate.from_messages([
("system", "You're an assistant. Your task is to markup sentiment for hotel reviews."),
("human", sentiment_msg)
])
sentiment_chain = sentiment_template | chat | StrOutputParser()
For positive reviews, we will ask the model to extract good points, while for negative ones — problems. So, we will need two different chains.
We will use the same Pydantic output parsers as before to specify the intended output format and generate instructions.
We used partial_variables
on top of the general topic assignment prompt message to specify different format instructions for positive and negative cases.
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate# defining structure for positive and negative cases
class PositiveCustomerCommentData(BaseModel):
topics: List[str] = Field(description="List of the relevant topics for the customer review. Please, include only topics from the provided list.")
advantages: List[str] = Field(description = "List the good points from that customer mentioned")
sentiment: str = Field(description="sentiment of the comment (positive, neutral or negative")
class NegativeCustomerCommentData(BaseModel):
topics: List[str] = Field(description="List of the relevant topics for the customer review. Please, include only topics from the provided list.")
problems: List[str] = Field(description = "List the problems that customer mentioned.")
sentiment: str = Field(description="sentiment of the comment (positive, neutral or negative")
# defining output parsers and generating instructions
positive_output_parser = PydanticOutputParser(pydantic_object=PositiveCustomerCommentData)
positive_format_instructions = positive_output_parser.get_format_instructions()
negative_output_parser = PydanticOutputParser(pydantic_object=NegativeCustomerCommentData)
negative_format_instructions = negative_output_parser.get_format_instructions()
general_topic_assignment_msg = '''
Below is a customer review delimited by ```.
Please, analyse the provided review and identify the main topics and sentiment. Include only topics from the provided below list.
List of topics with descriptions (delimited with ":"):
{topics_descr_list}
Output format:
{format_instructions}
Customer reviews:
```
{input_data}
```
'''
# defining prompt templates
positive_topic_assignment_template = ChatPromptTemplate(
messages=[
SystemMessagePromptTemplate.from_template("You're a helpful assistant. Your task is to analyse hotel reviews."),
HumanMessagePromptTemplate.from_template(general_topic_assignment_msg)
],
input_variables=["topics_descr_list", "input_data"],
partial_variables={"format_instructions": positive_format_instructions} )
negative_topic_assignment_template = ChatPromptTemplate(
messages=[
SystemMessagePromptTemplate.from_template("You're a helpful assistant. Your task is to analyse hotel reviews."),
HumanMessagePromptTemplate.from_template(general_topic_assignment_msg)
],
input_variables=["topics_descr_list", "input_data"],
partial_variables={"format_instructions": negative_format_instructions} )
So, now we need just to build the full chain. The main logic is defined using RunnableBranch
and condition based on sentiment, an output of sentiment_chain
.
from langchain.schema.runnable import RunnableBranchbranch = RunnableBranch(
(lambda x: "negative" in x["sentiment"].lower(), negative_chain),
positive_chain
)
full_route_chain = {
"sentiment": sentiment_chain,
"input_data": lambda x: x["input_data"],
"topics_descr_list": lambda x: x["topics_descr_list"]
} | branch
full_route_chain.invoke({'input_data': review,
'topics_descr_list': topics_list})
Here are a couple of examples. It works pretty well and returns different objects depending on the sentiment.
We’ve looked in detail at the modular approach to do Topic Modelling using LangChain and introduce more complex logic. Now, it’s time to move on to the second part and discuss how we could assess the model’s performance.
The crucial part of any system running in production is evaluation. When we have an LLM model running in production, we want to ensure quality and keep an eye on it over time.
In many cases, you could use not only human-in-the-loop (when people are checking the model results for a small sample over time to control performance) but also leverage LLM for this task as well. It could be a good idea to use a more complex model for runtime checks. For example, we used ChatGPT 3.5 for our topic assignments, but we could use GPT 4 for evaluation (similar to the concept of supervision in real life when you are asking more senior colleagues for a code review).
Langchain can help us with this task as well since it provides some tools to evaluate results:
- String Evaluators help to evaluate results from your model. There is quite a broad set of tools, from validating the format to assessing correctness based on provided context or reference. We will talk about these methods in detail below.
- The other class of evaluators are Comparison evaluators. They will be handy if you want to assess the performance of 2 different LLM models (A/B testing use case). We won’t go into their details today.
Exact match
The most straightforward approach is to compare the model’s output to the correct answer (i.e. from experts or a training set) using an exact match. For that, we could use ExactMatchStringEvaluator
, for example, to assess the performance of our sentiment analysis. In this case, we don’t need LLMs.
from langchain.evaluation import ExactMatchStringEvaluatorevaluator = ExactMatchStringEvaluator(
ignore_case=True,
ignore_numbers=True,
ignore_punctuation=True,
)
evaluator.evaluate_strings(
prediction="positive.",
reference="Positive"
)
# {'score': 1}
evaluator.evaluate_strings(
prediction="negative",
reference="Positive"
)
# {'score': 0}
You can build your own custom String Evaluator or match output to a regular expression.
Also, there are helpful tools to validate structured output, whether the output is a valid JSON, has the expected structure and is close to the reference by distance. You can find more details about it in the documentation.
Embeddings distance evaluation
The other handy approach is to look at the distance between embeddings. You will get a score in the result: the lower the score — the better, since answers are closer to each other. For example, we can compare found good points by Euclidean distance.
from langchain.evaluation import load_evaluator
from langchain.evaluation import EmbeddingDistanceevaluator = load_evaluator(
"embedding_distance", distance_metric=EmbeddingDistance.EUCLIDEAN
)
evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
reference="well designed rooms, clean, great location, good atmosphere"
)
{'score': 0.20732719121627757}
We got a distance of 0.2. However, the results of such evaluation might be more difficult to interpret since you will need to look at your data distributions and define thresholds. Let’s move on to approaches based on LLMs since we will be able to interpret their results effortlessly.
Criteria evaluation
You can use LangChain to validate LLM’s answer against some rubric or criteria. There’s a list of predefined criteria, or you can create a custom one.
from langchain.evaluation import Criteria
list(Criteria)[<Criteria.CONCISENESS: 'conciseness'>,
<Criteria.RELEVANCE: 'relevance'>,
<Criteria.CORRECTNESS: 'correctness'>,
<Criteria.COHERENCE: 'coherence'>,
<Criteria.HARMFULNESS: 'harmfulness'>,
<Criteria.MALICIOUSNESS: 'maliciousness'>,
<Criteria.HELPFULNESS: 'helpfulness'>,
<Criteria.CONTROVERSIALITY: 'controversiality'>,
<Criteria.MISOGYNY: 'misogyny'>,
<Criteria.CRIMINALITY: 'criminality'>,
<Criteria.INSENSITIVITY: 'insensitivity'>,
<Criteria.DEPTH: 'depth'>,
<Criteria.CREATIVITY: 'creativity'>,
<Criteria.DETAIL: 'detail'>]
Some of them don’t require reference (for example, harmfulness
or conciseness
). But for correctness
, you need to know the answer.
Let’s try to use it for our data.
evaluator = load_evaluator("criteria", criteria="conciseness")
eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="List the good points that customer mentioned",
)
As a result, we got the answer (whether the results fit the specified criterion) and chain-of-thought reasoning so that we could understand the logic behind the result and potentially tweak the prompt.
If you’re interested in how it works, you could switch on langchain.debug = True
and see the prompt sent to LLM.
Let’s look at the correctness criterion. To assess it, we need to provide a reference (the correct answer).
evaluator = load_evaluator("labeled_criteria", criteria="correctness")eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="List the good points that customer mentioned",
reference="well designed rooms, clean, great location, good atmosphere",
)
You can even create your own custom criteria, for example, whether multiple points are mentioned in the answer.
custom_criterion = {"multiple": "Does the output contain multiple points?"}evaluator = load_evaluator("criteria", criteria=custom_criterion)
eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="List the good points that customer mentioned",
)
Scoring evaluation
With criteria evaluation, we got only a Yes or No answer, but in many cases, it is not enough. For example, in our example, the prediction has 3 out of 4 mentioned points, which is a good result, but we got N when evaluating it for correctness. So, using this approach, answers “well-designed rooms, clean, great location” and “fast internet” will be equal in terms of our metrics, which won’t give us enough information to understand the model’s performance.
There’s another pretty close technique of scoring when you’re asking LLM to provide the score in the output, which might help to get more granular results. Let’s try it.
from langchain.chat_models import ChatOpenAIaccuracy_criteria = {
"accuracy": """
Score 1: The answer doesn't mention any relevant points.
Score 3: The answer mentions only few of relevant points but have major inaccuracies or includes several not relevant options.
Score 5: The answer has moderate quantity of relevant options but might have inaccuracies or wrong points.
Score 7: The answer aligns with the reference and shows most of relevant points and don't have completely wrong options mentioned.
Score 10: The answer is completely accurate and aligns perfectly with the reference."""
}
evaluator = load_evaluator(
"labeled_score_string",
criteria=accuracy_criteria,
llm=ChatOpenAI(model="gpt-4"),
)
eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="""Below is a customer review delimited by ```. Provide the list the good points that customer mentioned in the customer review.
Customer review:
```
Small but well designed rooms, clean, great location, good atmosphere. I would stay there again. Continental breakfast is weak but ok.
```
""",
reference="well designed rooms, clean, great location, good atmosphere"
)
We got seven as a score, which looks pretty valid. Let’s look at the actual prompt used.
However, I would treat scores from LLMs with a pinch of salt. Remember, it’s not a regression function, and scores might be pretty subjective.
We’ve been using the scoring model with the reference. But in many cases, we might not have the correct answers, or it could be expensive for us to get them. You can use the scoring evaluator even without reference scores asking the model to assess the answer. It’s worth using GPT-4 to be more confident in the results.
accuracy_criteria = {
"recall": "The asisstant's answer should include all mentioned in the question. If information is missing, score answer lower.",
"precision": "The assistant's answer should not have any points not present in the question."
}evaluator = load_evaluator("score_string", criteria=accuracy_criteria,
llm=ChatOpenAI(model="gpt-4"))
eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="""Below is a customer review delimited by ```. Provide the list the good points that customer mentioned in the customer review.
Customer review:
```
Small but well designed rooms, clean, great location, good atmosphere. I would stay there again. Continental breakfast is weak but ok.
```
"""
)
We got a pretty close score to the previous one.
We’ve looked at quite a lot of possible ways to validate your output, so I hope you are now ready to test your models’ results.
In this article, we’ve discussed some nuances we need to take into account if we want to use LLMs for production processes.
- We’ve looked at the use of the LangChain framework to make our solution more modular so that we could easily iterate and use new approaches (for example, switching from one LLM to another). Also, frameworks usually help to make our code easier to maintain.
- The other big topic we’ve discussed is the different tools we have to assess the model’s performance. If we are using LLMs in production, we need to have some constant monitoring in place to ensure the quality of our service, and it’s worth spending some time to create an evaluation pipeline based on LLMs or human-in-the-loop.
Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.
Ganesan, Kavita and Zhai, ChengXiang. (2011). OpinRank Review Dataset.
UCI Machine Learning Repository. https://doi.org/10.24432/C5QW4W
This article is based on information from the course “LangChain for LLM Application Development” by DeepLearning.AI and LangChain.