How to Detect Hallucinations in LLMs | by Iulia Brezeanu

Teaching Chatbots to Say “I Don’t Know”

10 min read

10 hours ago

Who is Evelyn Hartwell?

Evelyn Hartwell is an American author, speaker, and life coach…

Evelyn Hartwell is a Canadian ballerina and the founding Artistic Director…

Evelyn Hartwell is an American actress known for her roles in the…

No, Evelyn Hartwell is not a con artist with multiple false identities, living a deceptive triple life with various professions. In fact, she doesn’t exist at all, but the model, instead of telling me that it doesn’t know, starts making facts up. We are dealing with an LLM Hallucination.

Long, detailed outputs can seem really convincing, even if fictional. Does it mean that we cannot trust chatbots and have to manually fact-check the outputs every time? Fortunately, there could be ways to make chatbots less likely to say fabricated things with the right safeguards.

text-davinci-003 prompt completion on a fictional person. Image by the author.

For the outputs above, I set a higher temperature of 0.7. I am allowing the LLM to change the structure of its sentences in order not to have identical text for each generation. The differences between outputs should be just semantic, not factual.

This simple idea allowed for introducing a new sample-based hallucination detection mechanism. If the LLM’s outputs to the same prompt contradict each other, they will likely be hallucinations. If they are entailing each other, it implies the information is factual. [2]

For this type of evaluation, we only require the text outputs of the LLMs. This is known as black-box evaluation. Also, because we don’t need any external knowledge, is called zero-resource. [5]

Let’s start with a very basic way of measuring similarity. We will compute the pairwise cosine similarity between corresponding pairs of embedded sentences. We normalize them because we need to focus only on the vector’s direction, not magnitude. The function below takes as input the originally generated sentence called output and a…

Source link