Annotation
Manually labeled record pairs are useful in training and validation tasks. Training data is usually not available in record linkage applications because it is highly dataset and sample-specific. The Python Record Linkage Toolkit comes with a browser-based user interface for manually classifying record pairs. A hosted version of RecordLinkage ANNOTATOR can be found on Github.
Generate annotation file
The RecordLinkage ANNOTATOR software requires a structured annotation
file. The required schema of the annotation file is open. The function
recordlinkage.write_annotation_file()
can be used to render and save an
annotation file. The function can be used for both linking and deduplication
purposes.
- recordlinkage.write_annotation_file(fp, pairs, df_a, df_b=None, dataset_a_name=None, dataset_b_name=None, *args, **kwargs)
Render and export annotation file.
This function renders and annotation object and stores it in a json file. The function is a wrapper around the AnnotationWrapper class.
- Parameters:
fp (str) – The path to the annotation file.
pairs (pandas.MultiIndex) – The record pairs to annotate.
df_a (pandas.DataFrame) – The data frame with full record information for the pairs.
df_b (pandas.DataFrame) – In case of data linkage, this is the second data frame. Default None.
dataset_a_name (str) – The name of the first data frame.
dataset_b_name (str) – In case of data linkage, the name of the second data frame. Default None.
Linking
This is a simple example of the code to render an annotation file for linking records:
import recordlinkage as rl
from recordlinkage.index import Block
from recordlinkage.datasets import load_febrl4
df_a, df_b = load_febrl4()
blocker = Block("surname", "surname")
pairs = blocker.index(df_a, df_b)
rl.write_annotation_file(
"annotation_demo_linking.json",
pairs[0:50],
df_a,
df_b,
dataset_a_name="Febrl4 A",
dataset_b_name="Febrl4 B"
)
Deduplication
This is a simple example of the code to render an annotation file for duplicate detection:
import recordlinkage as rl
from recordlinkage.index import Block
from recordlinkage.datasets import load_febrl1
df_a = load_febrl1()
blocker = Block("surname", "surname")
pairs = blocker.index(df_a)
rl.write_annotation_file(
"annotation_demo_dedup.json",
pairs[0:50],
df_a,
dataset_a_name="Febrl1 A"
)
Manual labeling
Go to RecordLinkage ANNOTATOR or start the server yourself.
Choose the annotation file on the landing screen or use the drag and drop functionality. A new screen shows the first record pair to label. Start labeling data the manually. Use the button Match for record pairs belonging to the same entity. Use Distinct for record pairs belonging to different entities. After all records are labeled by hand, the result can be saved to a file.
Export/read annotation file
After labeling all record pairs, you can export the annotation file to a JSON
file. Use the function recordlinkage.read_annotation_file()
to read the
results.
import recordlinkage as rl
result = rl.read_annotation_file('my_annotation.json')
print(result.links)
The function recordlinkage.read_annotation_file()
reads the file and returns
an recordlinkage.annotation.AnnotationResult
object. This object contains
links and distinct attributes that return a pandas.MultiIndex
object.
- recordlinkage.read_annotation_file(fp)
Read annotation file.
This function can be used to read the annotation file and extract the results like the linked pairs and distinct pairs.
- Parameters:
fp (str) – The path to the annotation file.
- Returns:
AnnotationResult – An AnnotationResult object.
Example
Read the links from an annotation file:
> annotation = read_annotation_file("result.json") > print(annotation.links)
- class recordlinkage.annotation.AnnotationResult(pairs=None, version=1)
Result of (manual) annotation.
- Parameters:
- property links
Return the links.
- Returns:
pandas.MultiIndex – The links stored in a pandas MultiIndex.
- property distinct
Return the distinct pairs.
- Returns:
pandas.MultiIndex – The distinct pairs stored in a pandas MultiIndex.
- property unknown
Return the unknown or unlaballed pairs.
- Returns:
pandas.MultiIndex – The unknown or unlaballed pairs stored in a pandas MultiIndex.