The P&F data science team faces a challenge: They must weigh each expert opinion equally, but can’t satisfy everyone. Instead of focusing on expert subjective opinions, they decide to evaluate the chatbot on historical customer questions. Now experts do not need to come up with questions to test the chatbot, bringing the evaluation closer to real-world conditions. The initial reason for involving experts, after all, was their better understanding of real customer questions compared to the P&F data science team.
It turns out that commonly asked questions for P&F are related to paper clip technical instructions. P&F customers want to know detailed technical specifications of the paper clips. P&F has thousands of different paper clip types, and it takes a long time for customer support to answer the questions.
Understanding the test-driven development, the data science team creates a dataset from the conversation history, including the customer question and customer support reply:
Having a dataset of questions and answers, P&F can test and evaluate the chatbot’s performance retrospectively. They create a new column, “Chatbot reply”, and store the chatbot example replies to the questions.
We can have the experts and GPT-4 evaluate the quality of the chatbot’s replies. The ultimate goal is to automate the chatbot accuracy evaluation by utilizing GPT-4. This is possible if experts and GPT-4 evaluate the replies similarly.
Experts create a new Excel sheet with each expert’s evaluation, and the data science team adds the GPT-4 evaluation.
There are conflicts on how different experts evaluate the same chatbot replies. GPT-4 evaluates similarly to expert majority voting, which indicates that we could do automatic evaluations with GPT-4. However, each expert’s opinion is valuable, and it’s important to address the conflicting evaluation preferences among the experts.
P&F organizes a workshop with the experts to create golden standard responses to the historical question dataset
and evaluation best practice guidelines, to which all experts agree.
With the insights from the workshop, the data science team can create a more detailed evaluation prompt for the GPT-4 that covers edge cases (i.e. “chatbot should not ask to raise support tickets”). Now the experts can use time to improve the paper clip documentation and define best practices, instead of laborious chatbot evaluations.
By measuring the percentage of correct chatbot replies, P&F can decide whether they want to deploy the chatbot to the support channel. They approve the accuracy and deploy the chatbot.
Finally, it’s time to save all the chatbot responses and calculate how well the chatbot performs to solve real customer inquiries. As the customer can directly respond to the chatbot, it is also important to record the response from the customer, to understand the customer’s sentiment.
The same evaluation workflow can be used to measure the chatbot’s success factually, without the ground truth replies. But now the customers are getting the initial reply from a chatbot, and we do not know if the customers like it. We should investigate how customers react to the chatbot’s replies. We can detect negative sentiment from the customer’s replies automatically, and assign customer support specialists to handle angry customers.