In the rapidly evolving field of natural language processing, Transformers have emerged as dominant models, demonstrating remarkable performance across a wide range of sequence modelling tasks, including part-of-speech tagging, named entity recognition, and chunking. Prior to the era of Transformers, Conditional Random Fields (CRFs) were the go-to tool for sequence modelling and specifically linear-chain CRFs that model sequences as directed graphs while CRFs more generally can be used on arbitrary graphs.
This article will be broken down as follows:
- Introduction
- Emission and Transition scores
- Loss function
- Efficient estimation of partition function through Forward Algorithm
- Viterbi Algorithm
- Full LSTM-CRF code
- Drawbacks and Conclusions
The implementation of CRFs in this article in based on this excellent tutorial. Please note that it’s definitely not the most efficient implementation out there and also lacks batching capability, however, it’s relatively simple to read and understand and because the aim of this tutorial is to get our heads around the internal working of CRFs it is perfectly suitable for us.
In sequence tagging problems, we deal with a sequence of input data elements, such as the words within a sentence, where each element corresponds to a specific label or category. The primary objective is to correctly assign the appropriate label to each individual element. Within the CRF-LSTM model we can identify two key components to do this: emission and transition probabilities. Note we will actually deal with scores in log space instead of probabilities for numerical stability:
- Emission scores relate to the likelihood of observing a particular label for a given data element. In the context of named entity recognition, for example, each word in a sequence is affiliated with one of three labels: Beginning of an entity (B), Intermediate word of an entity (I), or a word outside to any entity (O). Emission probabilities quantify the probability of a specific word being associated with a particular label. This is expressed mathematically as P(y_i | x_i), where y_i denotes the label and x_i represents the…