Note

This page was generated from docs/guides/data_deduplication.ipynb. Run an online interactive version of this page with or .

Data deduplication

Introduction

This example shows how to find records in datasets belonging to the same entity. In our case, we try to deduplicate a dataset with records of persons. We will try to link within the dataset based on attributes like first name, surname, sex, date of birth, place and address. The data used in this example is part of Febrl and is fictitious.

First, start with importing the recordlinkage module. The submodule recordlinkage.datasets contains several datasets that can be used for testing. For this example, we use the Febrl dataset 1. This dataset contains 1000 records of which 500 original and 500 duplicates, with exactly one duplicate per original record. This dataset can be loaded with the function load_febrl1.

[1]:

import recordlinkage
from recordlinkage.datasets import load_febrl1

The dataset is loaded with the following code. The returned datasets are of type pandas.DataFrame. This makes it easy to manipulate the data if desired. For details about data manipulation with pandas, see their comprehensive documentation http://pandas.pydata.org/.

[2]:

dfA = load_febrl1()
dfA

[2]:

	given_name	surname	street_number	address_1	address_2	suburb	postcode	state	date_of_birth	soc_sec_id
rec_id
rec-223-org	NaN	waller	6	tullaroop street	willaroo	st james	4011	wa	19081209	6988048
rec-122-org	lachlan	berry	69	giblin street	killarney	bittern	4814	qld	19990219	7364009
rec-373-org	deakin	sondergeld	48	goldfinch circuit	kooltuo	canterbury	2776	vic	19600210	2635962
rec-10-dup-0	kayla	harrington	NaN	maltby circuit	coaling	coolaroo	3465	nsw	19150612	9004242
rec-227-org	luke	purdon	23	ramsay place	mirani	garbutt	2260	vic	19831024	8099933
...	...	...	...	...	...	...	...	...	...	...
rec-188-dup-0	stephanie	geu	28	bainton crescent	masonic memorial village	maryborough	2541	sa	19421008	3997529
rec-334-dup-0	nicholas	NaN	289	britten-jonues drive	jabaru court	paddington	2000	vic	19970422	5062738
rec-469-dup-0	lachlan	katsiavos	29	paul coe cdrescent	NaN	casual	2913	nsw	19380406	4112327
rec-350-dup-0	monique	gergely	21	harwoos court	hyberni a park	sherwood	2207	nsw	19790807	7375144
rec-212-org	NaN	mcveigh	45	bougainville street	kimberley	ourimbah	6060	wa	19360219	8243761

1000 rows × 10 columns

Make record pairs

It is very intuitive to start with comparing each record in DataFrame dfA with all other records in DataFrame dfA. In fact, we want to make record pairs. Each record pair should contain two different records of DataFrame dfA. This process of making record pairs is also called “indexing”. With the recordlinkage module, indexing is easy. First, load the recordlinkage.Index class and call the .full method. This object generates a full index on a .index(...) call. In case of deduplication of a single dataframe, one dataframe is sufficient as input argument.

[3]:

indexer = recordlinkage.Index()
indexer.full()
candidate_links = indexer.index(dfA)

WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.

With the method index, all possible (and unique) record pairs are made. The method returns a pandas.MultiIndex. The number of pairs is equal to the number of records in dfA choose 2.

[4]:

print(len(dfA), len(candidate_links))
# (1000*1000-1000)/2 = 499500

1000 499500

Many of these record pairs do not belong to the same person. The recordlinkage toolkit has some more advanced indexing methods to reduce the number of record pairs. Obvious non-matches are left out of the index. Note that if a matching record pair is not included in the index, it can not be matched anymore.

One of the most well known indexing methods is named blocking. This method includes only record pairs that are identical on one or more stored attributes of the person (or entity in general). The blocking method can be used in the recordlinkage module.

[5]:

indexer = recordlinkage.Index()
indexer.block("given_name")
candidate_links = indexer.index(dfA)
len(candidate_links)

[5]:

The argument “given_name” is the blocking variable. This variable has to be the name of a column in dfA. It is possible to parse a list of columns names to block on multiple variables. Blocking on multiple variables will reduce the number of record pairs even further.

Another implemented indexing method is Sorted Neighbourhood Indexing (recordlinkage.index.sortedneighbourhood). This method is very useful when there are many misspellings in the string were used for indexing. In fact, sorted neighbourhood indexing is a generalisation of blocking. See the documentation for details about sorted neighbourd indexing.

Compare records

Each record pair is a candidate match. To classify the candidate record pairs into matches and non-matches, compare the records on all attributes both records have in common. The recordlinkage module has a class named Compare. This class is used to compare the records. The following code shows how to compare attributes.

[6]:

compare_cl = recordlinkage.Compare()
compare_cl.exact("given_name", "given_name", label="given_name")
compare_cl.string(
    "surname", "surname", method="jarowinkler", threshold=0.85, label="surname"
)
compare_cl.exact("date_of_birth", "date_of_birth", label="date_of_birth")
compare_cl.exact("suburb", "suburb", label="suburb")
compare_cl.exact("state", "state", label="state")
compare_cl.string("address_1", "address_1", threshold=0.85, label="address_1")
features = compare_cl.compute(candidate_links, dfA)

The comparing of record pairs starts when the compute method is called. All attribute comparisons are stored in a DataFrame with horizontally the features and vertically the record pairs. The first 10 comparison vectors are:

[7]:

features.head(10)

[7]:

		given_name	surname	date_of_birth	suburb	state	address_1
rec_id_1	rec_id_2
rec-183-dup-0	rec-122-org	1	0.0	0	0	0	0.0
rec-248-org	rec-122-org	1	0.0	0	0	1	0.0
rec-248-org	rec-183-dup-0	1	0.0	0	0	0	0.0
rec-122-dup-0	rec-122-org	1	1.0	1	1	1	1.0
	rec-183-dup-0	1	0.0	0	0	0	0.0
	rec-248-org	1	0.0	0	0	1	0.0
rec-469-org	rec-122-org	1	0.0	0	0	0	0.0
	rec-183-dup-0	1	0.0	0	0	1	0.0
	rec-248-org	1	0.0	0	0	0	0.0
	rec-122-dup-0	1	0.0	0	0	0	0.0

[8]:

features.describe()

[8]:

	given_name	surname	date_of_birth	suburb	state	address_1
count	2082.0	2082.000000	2082.000000	2082.000000	2082.000000	2082.000000
mean	1.0	0.144092	0.139289	0.108549	0.327089	0.133045
std	0.0	0.351268	0.346331	0.311148	0.469263	0.339705
min	1.0	0.000000	0.000000	0.000000	0.000000	0.000000
25%	1.0	0.000000	0.000000	0.000000	0.000000	0.000000
50%	1.0	0.000000	0.000000	0.000000	0.000000	0.000000
75%	1.0	0.000000	0.000000	0.000000	1.000000	0.000000
max	1.0	1.000000	1.000000	1.000000	1.000000	1.000000

The last step is to decide which records belong to the same person. In this example, we keep it simple:

[9]:

features.sum(axis=1).value_counts().sort_index(ascending=False)

[9]:

6.0     142
5.0     145
4.0      30
3.0       9
2.0     376
1.0    1380
dtype: int64

[10]:

matches = features[features.sum(axis=1) > 3]
matches

[10]:

		given_name	surname	date_of_birth	suburb	state	address_1
rec_id_1	rec_id_2
rec-122-dup-0	rec-122-org	1	1.0	1	1	1	1.0
rec-183-org	rec-183-dup-0	1	1.0	1	1	1	1.0
rec-248-dup-0	rec-248-org	1	1.0	1	1	1	1.0
rec-373-dup-0	rec-373-org	1	1.0	1	1	1	1.0
rec-10-org	rec-10-dup-0	1	1.0	1	1	1	1.0
...	...	...	...	...	...	...	...
rec-184-dup-0	rec-184-org	1	1.0	1	0	1	1.0
rec-252-org	rec-252-dup-0	1	1.0	1	1	1	1.0
rec-48-dup-0	rec-48-org	1	1.0	1	1	1	1.0
rec-298-dup-0	rec-298-org	1	1.0	1	1	1	0.0
rec-282-org	rec-282-dup-0	1	1.0	1	1	1	0.0

317 rows × 6 columns

Full code

[11]:

import recordlinkage
from recordlinkage.datasets import load_febrl1

dfA = load_febrl1()

# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on="given_name")
candidate_links = indexer.index(dfA)

# Comparison step
compare_cl = recordlinkage.Compare()

compare_cl.exact("given_name", "given_name", label="given_name")
compare_cl.string(
    "surname", "surname", method="jarowinkler", threshold=0.85, label="surname"
)
compare_cl.exact("date_of_birth", "date_of_birth", label="date_of_birth")
compare_cl.exact("suburb", "suburb", label="suburb")
compare_cl.exact("state", "state", label="state")
compare_cl.string("address_1", "address_1", threshold=0.85, label="address_1")

features = compare_cl.compute(candidate_links, dfA)

# Classification step
matches = features[features.sum(axis=1) > 3]
print(len(matches))