You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
clean: datasets with duplicates, no modifiction. Datasets are named by clean_x_y_i, where x is the dataset size, y is the percentage of duplicates, and i is the dataset identifier (assuming two data sources).
dirty_l_m_n: datasets with corrupted duplicates, generated by Febrl with some post-processing to separate originals and duplicates. (l: the maximum number of duplicates for each original record; m: the maximum number of modifications in a field; n: the maximum number of modification in a record).
dirty_typo: datasets with corrupted duplicates. Only the 'Surname' field values are modified. The modification types are insertion, deletion and substitution; with equal probability; the error positions are randomly selected.
Yangfeng suggested looking at febrl to generate data with pertubations.
Manual - http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/
Additional test sets:
Aha! Link: https://csiro.aha.io/features/ANONLINK-76
The text was updated successfully, but these errors were encountered: