How effectively do different approaches to record linkage use information in the records to make predictions?
A pervasive data quality problem is to have multiple different records that refer to the same entity but no unique identifier that ties these entities together.
In the absence of a unique identifier such as a Social Security number, we can use a combination of individually non-unique variables such as name, gender and date of birth to identify individuals.
To get the best accuracy in record linkage, we need a model that wrings as much information from this input data as possible.
This article describes the three types of information that are most important in making an accurate prediction, and how all three are leveraged by the Fellegi-Sunter model as used in Splink.
It also describes how some alternative record linkage approaches throw away some of this information, leaving accuracy on the table.
The three types of information
Broadly, there are three categories of information that are relevant when trying to predict whether a pair of records match:
- Similarity of the pair of records
- Frequency of values in the overall dataset, and more broadly measuring how common different scenarios are
- Data quality of the overall dataset
Let’s look at each in turn.
1. Similarity of the pairwise record comparison: Fuzzy matching
The most obvious way to predict whether two records represent the same entity is to measure whether the columns contain the same or similar information.
The similarity of each column can be measured quantitatively using fuzzy matching functions like Levenshtein or Jaro-Winker for text, or numeric differences such as absolute or percentage difference.
For example, Hammond
vs Hamond
has a Jaro-Winkler similarity of 0.97 (1.0 is a perfect score). It’s probably a typo.
These measures could be assigned weights, and summed together to compute a total similarity score.
The approach is sometimes known as fuzzy matching, and it is an important part of an accurate linkage model.
However using this approach alone has major drawback: the weights are arbitrary:
- The importance of different fields has to be guessed at by the user. For example, what weight should be assigned to a match on age? How does this compare to a match on first name? How should we decide on the size of punitive weights when information does not matches?
- The relationship between the strength of prediction and each fuzzy matching metric has to be guessed by the user, as opposed to being estimated. For example, how much should our prediction change if the first name is a Jaro-Winkler 0.9 fuzzy match as opposed to an exact match? Should it change by the same amount if the Jaro-Winkler score reduces to 0.8?
2. Frequency of values in the overall dataset, or more broadly measuring how common different scenarios are
We can improve on fuzzy matching by accounting for the frequency of values in the overall dataset (sometimes known as ‘term frequencies’).
For example, John
vs John
, and Joss
vs Joss
are both exact matches so have the same similarity score, but the later is stronger evidence of a match than the former, because Joss
is an unusual name.
The relative term frequencies of John
v Joss
provide a data-driven estimate of the relative importance of these different names, which can be used to inform the weights.
This concept can be extended to encompass similar records that are not an exact match. Weights can derived from an estimate of how common it is to observe fuzzy matches across the dataset. For example, if it’s really common to see fuzzy matches on first name at a Jaro-Winkler score of 0.7, even amongst non-matching records, then if we observe such a match, it doesn’t offer much evidence in favour of a match. In probabilistic linkage, this information is captured in parameters known as the u
probabilities, which is described in more detail here.
3. Data quality of the overall dataset: measuring the importance of non-matching information
We’ve seen that fuzzy matching and term frequency based approaches can allow us to score the similarity between records, and even, to some extent, weight the importance of matches on different columns.
However, none of these techniques help quantify the relative importance of non-matches to the predicted match probability.
Probabilistic methods explicitly estimate the relative importance of these scenarios by estimating data quality. In probabilistic linkage, this information is captured in the m
probabilities, which are defined more precisely here.
For example, if the data quality in the gender variable is extremely high, then a non-match on gender would be strong evidence against the two records being a true match.
Conversely, if records have been observed over a number of years, a non-match on age wouldn’t be strong evidence of the two records being a match.
Probabilistic linkage
Much of the power of probabilistic models comes from combining all three sources of information in a way which is not possible in other models.
Not only is all of this information be incorporated in the prediction, the partial match weights in the Fellegi-Sunter model enable the relative importance of the different types of information to be estimated from the data itself, and hence weighted together correctly to optimise accuracy.
Conversely, fuzzy matching techniques often use arbitrary weights, and cannot fully incorporate information from all three sources. Term frequency approaches lack the ability to use information about data quality to negatively weight non-matching information, or a mechanism to appropriately weight fuzzy matches.
The author is the developer of Splink, a free and open source Python package for probabilistic linkage at scale.