Everything You Should Know About Evaluating Large Language Models | by Donato Riccio

Open Language Models

From perplexity to measuring general intelligence

10 min read

17 hours ago

Image generated by the author using Stable Diffusion.

As open source language models become more readily available, getting lost in all the options is easy.

How do we determine their performance and compare them? And how can we confidently say that one model is better than another?

This article provides some answers by presenting training and evaluation metrics, and general and specific benchmarks to have a clear picture of your model’s performance.

If you missed it, take a look at the first article in the Open Language Models series:

Language models define a probability distribution over a vocabulary of words to select the most likely next word in a sequence. Given a text, a language model assigns a probability to each word in the language, and the most likely is selected.

Perplexity measures how well a language model can predict the next word in a given sequence. As a training metric, it shows how well the models learned its training set.

We won’t go into the mathematical details but intuitively, minimizing perplexity means maximizing the predicted probability.

In other words, the best model is the one that is not surprised when it sees the new text because it’s expecting it — meaning it already predicted well what words are coming next in the sequence.

While perplexity is helpful, it doesn’t consider the meaning behind the words or the context in which they are used, and it’s influenced by how we tokenize our data — different language models with varying vocabularies and tokenization techniques can produce varying perplexity scores, making direct comparisons less meaningful.

Perplexity is a useful but limited metric. We use it primarily to track progress during a model’s training or to compare…

Source link

What's Hot

Writer Researchers Introduce Writing in the Margins (WiM): A New Inference Pattern for Large Language Models Designed to Optimize the Handling of Long Input Sequences in Retrieval-Oriented Tasks

Empowering YouTube creators with generative AI

How to set up financial document automation using AI

Everything You Should Know About Evaluating Large Language Models | by Donato Riccio | Aug, 2023

Asking for Feedback as a Data Scientist Individual Contributor | by Jose Parreño | Sep, 2024

The Mystery Behind the PyTorch Automatic Mixed Precision Library | by Mengliu Zhao | Sep, 2024

GPU Accelerated Polars — Intuitively and Exhaustively Explained | by Daniel Warfield | Sep, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

Writer Researchers Introduce Writing in the Margins (WiM): A New Inference Pattern for Large Language Models Designed to Optimize the Handling of Long Input Sequences in Retrieval-Oriented Tasks

Empowering YouTube creators with generative AI

How to set up financial document automation using AI

DreamHOI: A Novel AI Approach for Realistic 3D Human-Object Interaction Generation Using Textual Descriptions and Diffusion Models

Our Picks

Writer Researchers Introduce Writing in the Margins (WiM): A New Inference Pattern for Large Language Models Designed to Optimize the Handling of Long Input Sequences in Retrieval-Oriented Tasks

Empowering YouTube creators with generative AI

How to set up financial document automation using AI

What's Hot

Everything You Should Know About Evaluating Large Language Models | by Donato Riccio | Aug, 2023

Open Language Models

From perplexity to measuring general intelligence

Related Posts

Leave A Reply Cancel Reply