Originally posted: 2024-01-18. View source code for this page here.

Evaluating 1 billion record comparisons to deduplicate 7 million records in two minutes

Data deduplication is a ubiquitous problem that results from multiple records being collected for the same entity:

first_namesurnamedobcitysource_data_system
lucassmith1984-01-02Londonsales
lucassmyth1984-07-02Manchestersales
lucassmyth1984-07-02marketing

The lack of a unique identifier means that there's no easy way of knowing whether these records refer to the same person.1

Historically, deduplicating large datasets has been very time consuming, with runtimes of many hours, and severe limits on the size of datasets that can be worked with.

Splink is a free, open source Python library to address this problem. It's designed for use on very large datasets, so speed is imperative. It uses DuckDB as its default backend to achieve fast parallelised execution.

This blog post presents the results of benchmarking Splink. It shows it is able to deduplicate a 7 million record dataset in just over 2 minutes, and at a cost of less than $1.00 using AWS EC2.

This summary chart shows how runtime varies by the spec of the machine used, with the lowest spec machine comparable to laptop:

To the best of my knowledge, these results show that it is fastest free tool for accurately deduplicating large datasets - by at least an order of magnitude.2

To test the performance of Splink, a dataset of people was collected from the Wikidata Query Service.

Each record was corrupted to generate a varying number of duplicates according to a zipf distribution. The median number of duplicates was 8 and the number of duplicates ranged from 0 to 35.

For example, the rows corresponding to WikiData ID Q101637549 look like this:

first_namemiddle_namelast_namedobbirth_latbirth_lngoccupation
josefheřman1845-01-1349.7413.38high school teacher
josefheřman1845-01-13high school teacher
josefheřjan1845-01-1349.9613.35high school teacher
josephhermann1845-01-1349.7413.38travel writer
jowefheřman1845-01-0149.7413.38high school teacher

We can see different types of errors in all the fields, missing data, and no simple rule that could be used to be sure all these records pertain to the same entity.

A Splink model was developed that is representative of the complexity of a typical large scale linkage job:3

  • A wide range of blocking rules are used to create record comparisons, ensuring high recall of true matches. The blocking rules are relatively loose and create 1bn comparisons in total.4
  • Term frequency adjustments are made on columns with skew in value frequencies
  • A range of fuzzy matching scenarios are evaluated using functions like jaro-winkler and levenshtein.

The code for estimating (training) the Splink model, and then using it to predict which record match is as follows:

Click to expand full Splink script
import pandas as pd
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
distance_in_km_at_thresholds,
exact_match,
jaro_winkler_at_thresholds,
levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker
df = pd.read_parquet("7m_prepareed.parquet")
splink_settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["last_name", "occupation"]),
block_on(["first_name", "last_name"]),
block_on(["first_name", "middle_name"]),
block_on(["middle_name", "last_name"]),
block_on(["first_name", "dob"]),
block_on(["first_name", "middle_name"]),
block_on(["last_name", "birth_lat"]),
block_on(["first_name", "birth_lng"]),
block_on(["middle_name", "occupation"]),
],
"comparisons": [
jaro_winkler_at_thresholds(
"first_name", [0.9, 0.7], term_frequency_adjustments=True
),
jaro_winkler_at_thresholds("middle_name", [0.9]),
jaro_winkler_at_thresholds(
"last_name", [0.9, 0.7], term_frequency_adjustments=True
),
levenshtein_at_thresholds(
"dob", [1, 2, 4], term_frequency_adjustments=True
),
distance_in_km_at_thresholds(
"birth_lat", "birth_lng", [10, 100]
),
exact_match("occupation", term_frequency_adjustments=True),
],
"retain_intermediate_calculation_columns": False,
"retain_matching_columns": False,
"max_iterations": 20,
"em_convergence": 0.001,
}
linker = DuckDBLinker("df", splink_settings)
# Model training: Estimate the parameters of the model
linker.estimate_probability_two_random_records_match(
[
block_on(["first_name", "last_name", "dob"]),
],
recall=0.8,
)
linker.estimate_u_using_random_sampling(max_pairs=1e9)
linker.estimate_parameters_using_expectation_maximisation(
block_on(
["first_name", "last_name", "occupation"], salting_partitions=2
),
estimate_without_term_frequencies=True,
)
linker.estimate_parameters_using_expectation_maximisation(
block_on(["dob", "middle_name"], salting_partitions=2),
estimate_without_term_frequencies=True,
)
# Inference: Predict which pairs are matching entities
df_predict = linker.predict(threshold_match_probability=0.9)
# Clustering: Compute an estimated cluster id
# that ties together records of the same entity
linker.cluster_pairwise_predictions_at_threshold(
df_predict=df_predict, threshold_match_probability=0.9
)

Benchmarking was conducted for both model training (parameter estimation) and inference (making predictions).

The following chart shows runtime by the Splink function, broken down by the EC2 instance type. Note a logarithmic scale is used to ensure all values are legible.

The next chart shows total CPU and memory usage. The CPU metric is average across cores, so 100% means all cores are being fully used. Click on the bar chart at the top to select the instance type.

We can see that DuckDB parallelises workloads very well. Memory usage stays low on the larger instances types.

Not only can we perform deduplication quickly on high-spec machine, we can see the full set of benchmarks completed successfully on a machine comparable to a laptop (c6g.2xlarge, which has 8vCPU/16GiB memory), in around 25 minutes.5

A chart of runtimes three other popular open source record linkage packages can be found on page 362 of this paper from the authors of the fastLink R package.

The chart shows that of the three, fastLink has by far the best performance, at 400 minutes to deduplicate 300,000 records on an 8 core machine, with runtime increasing approximately quadratically with the number of input records.

Splink is able to deduplicate the same number of records in less than a minute on a comparable machine, suggesting it's at least two orders of magnitude faster.

A caveat is that these runtimes are not strictly comparable because the input datasets are not the same. However, the size of the differences are so large that it's almost certain that Splink is a great deal faster.

The low memory usage on large EC2 instance types suggests there's scope to scale to substantially larger datasets using DuckDB.

For the purpose of this post, I chose to make a relatively large number of comparisons per input record (around 100) to demonstrate the speed of Splink. In real record linkage scenarios, especially those where we have more columns of data about the entities, it's often possible to use tighter blocking rules, resulting in fewer record comparisons per input record.

Both these observations suggest that DuckDB could be used to link datasets of many tens of millions or records or more.

Splink also offers big data backends like Spark and Athena as backends for linking truly enormous datasets.

This post considers runtime only. Accuracy statistics would only be valid on the specific dataset used for benchmarking, and I do not have comparative accuracy statistics for other libraries.

However, there are theoretical and empirical reasons to believe Splink offers accuracy comparable to or better than alternatives:

  • Splink is able to effectively harness most of the information in the input data in making a prediction - as explained in this post.
  • An independent empirical study has found Splink's accuracy compares favourably to alternatives.
  • Splink offers a range of configuration options to optimise accuracy which are detailed further in this post.

I hope to write another post at some point looking at comparative accuracy in further details

Splink supports various SQL backends, so the same job was run on the Apache Spark backend to compare runtimes.6

We can see that Spark takes about 60% longer than DuckDB for estimating the u parameters, but for inference (making predictions) it takes 8.7 times longer.

The following chart shows CPU usage, and goes some way to explaining the very long runtime of the predict() step.

We can see that whilst the predict step parallelises well at first, there are straggler tasks on some CPUs that take a long time to complete.

A rough estimate suggests if problem of straggler tasks could be addressed and the CPUs could be pinned at 100%, Spark would still take around twice as long as DuckDB7.

Overall, it's clear that DuckDB is both intrinsically faster for data linking workloads, and also does a better job of parallelisation across all cores out of the box, resulting in substantially shorter runtimes.

Benchmarking was conducted on a variety of AWS EC2 instances using pytest-benchmark. Code in available in the following two repositories:

  1. This problem is also sometimes referred to as entity resolution, or record linkage.

  2. Please contact me if you know otherwise!

  3. It would be possible to achieve much faster runtimes by reducing complexity, at the expense of accuracy. For instance, reducing the use of fuzzy-match functions like Jaro Winkler, and avoiding term frequency adjustments. However, this would not be representative of a typical real-world use case.

  4. It's important to emphasise the number of comparisons (1 billion) as well as the number of input rows (7 million) because performance is primarily driven by the former. Furthermore it would be possible to 'cheat' and get a better runtimes simply by using stricter blocking, at the cost of lower recall. See here for further details.

  5. Out of memory errors were encountered on a c6g.xlarge (4vCPU/8GiB memory) machine.

  6. The job was run in local mode on a c6gd.4xlarge instance, which has an SSD attached. For Spark, this is important because some intermediate results are written to disk. Note no results are present for the cluster step because the instance ran out of disk space before it could complete.

  7. With experimentation, the straggler tasks could probably be largely avoided, but it would take several runs to find the optimal configuration, and requires significant expertise in both Spark and Splink.