Considering the rapid advancements in the field of LLM “chains”, “agents”, chatbots and other use cases of text-generative AI, evaluating the performance of language models is crucial for understanding their capabilities and limitations. Especially crucial to be able to adapt those metrics according to the business goals.
While standard metrics like perplexity, BLEU scores and Sentence distance provide a general indication of model performance, based on my experience, they often underperform in capturing the nuances and specific requirements of real-world applications.
For example, take a simple RAG QA application. When building a question-answering system, factors of the so-called “RAG Triad” like context relevance, groundedness in facts, and language consistency between the query and response are important as well. Standard metrics simply cannot capture these nuanced aspects effectively.
This is where LLM-based “Blackbox” metrics come in handy. While the idea can sound naive the concept behind LLM-based “blackbox” metrics is quite compelling. These metrics utilise the power of large language models themselves to evaluate the…