Skip to content

Latest commit

 

History

History
154 lines (126 loc) · 8.12 KB

README.md

File metadata and controls

154 lines (126 loc) · 8.12 KB

LinkOrgs: An R package for linking linking records on organizations using half-a-billion open-collaborated records from LinkedIn

What is LinkOrgs? | Installation | Tutorial | Comparison with Ground Truth | References | Documentation

March 2, 2024: LinkOrgs has been rebuilt with a JAX machine learning backend for improved computational efficiency and we're currently adding locality sensitive hashing from zoomerjoin for better handling of massive merge tasks (> 1 million observations in each data corpus). The re-build is experimental, so let us know if you come across any issues.

NB: You can also check out new, easier to use, implementation here.

Installation

The most recent version of LinkOrgs can be installed directly from the repository using the devtools package

# install package 
devtools::install_github("cjerzak/LinkOrgs-software/LinkOrgs")

The machine-learning based algorithm accessible via the algorithm="ml" option relies on tensorflow and Rtensorflow. For details about downloading, see https://tensorflow.rstudio.com/installation/. The network-based linkage approaches (algorithm="bipartite" and algorithm = "markov") do not require these packages. To setup the machine learning backend, you can call

# install ML backend  
LinkOrs::BuildBackend(conda_env = "LinkOrgsEnv", conda = "auto")

Note that most package options require Internet access in order to download the saved machine learning model parameters and LinkedIn-based network information.

Tutorial

After installing the package, let's get some experience with it in an example.

# load in package 
library(LinkOrgs)

# set up synthetic data for the merge 
x_orgnames <- c("apple","oracle","enron inc.","mcdonalds corporation")
y_orgnames <- c("apple corp","oracle inc","enron","mcdonalds")
x <- data.frame("orgnames_x"=x_orgnames)
y <- data.frame("orgnames_y"=y_orgnames)

After creating these synthetic datasets, we're now ready to merge them. We can do this in a number of ways. See the paper listed in the reference for information about which may be most useful for your merge task.

First, we'll try a merge using parallelized fast fuzzy matching via LinkOrgs::LinkOrgs. A key hyperparameter is AveMatchNumberPerAlias, which controls the number of matches per alias (in practice, we calibrate this with an initial random sampling step, the exact matched dataset size won't be a perfect multiple of AveMatchNumberPerAlias). Here, we set AveMatchNumberPerAlias = 10 so that all observations in this small dataset are potentially matched against all others for illustration purposes.

# perform merge using (parallelized) fast fuzzy matching
# LinkOrgs::LinkOrgs can be readily used for non-organizational name matches 
# when doing pure parallelized fuzzy matching 
z_linked_fuzzy <- LinkOrgs::LinkOrgs(x  = x,
                        y =  y,
                        by.x = "orgnames_x",
                        by.y = "orgnames_y",
                        algorithm = "fuzzy", 
                        DistanceMeasure = "jaccard", 
                        AveMatchNumberPerAlias = 4)

Next, we'll try using some of the LinkedIn-calibrated approaches using LinkOrgs::LinkOrgs:

# perform merge using bipartite network approach
z_linked_bipartite <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10,
                     algorithm = "bipartite", 
                     DistanceMeasure = "jaccard")
                     
# perform merge using markov network approach
z_linked_markov <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10,
                     algorithm = "markov", 
                     DistanceMeasure = "jaccard")


# Build backend for ML model (do this only once)
# try(LinkOrgs::BuildBackend( conda_env = "LinkOrgsEnv", conda = "auto" ),T)

# if conda = "auto" fails, try to specify the path to the correct conda 
# LinkOrgs::BuildBackend( conda_env = "LinkOrgsEnv", conda = "/Users/cjerzak/miniforge3/bin/python" )
                     
# perform merge using a machine learning approach
z_linked_ml <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10, 
                     conda_env = "LinkOrgsEnv",  
		     algorithm = "ml", ml_version = "v4")
# note: change "tensorflow" to name of conda environment where a version of tensorflow v2 lives
                     
# perform merge using combined network + machine learning approach
z_linked_combined <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10, 
                     AveMatchNumberPerAlias_network = 1, 
                     algorithm = "markov",
                     conda_env = "LinkOrgsEnv", 
                     DistanceMeasure = "ml", ml_version = "v4")
# note: change "tensorflow" to name of conda environment where a version of tensorflow v2 lives

Comparison of Results with Ground Truth

Using the package, we can also assess performance against a ground-truth merged dataset (if available):

# (After running the above code)
z_true <- data.frame("orgnames_x"=x_orgnames, "orgnames_y"=y_orgnames)

# Get performance matrix 
PerformanceMatrix <- AssessMatchPerformance(x  = x, 
                                            y =  y, 
                                            by.x = "orgnames_x", 
                                            by.y = "orgnames_y", 
                                            z = z_linked, 
                                            z_true = z_true)

Improvements & Future Development Plan

We're always looking to improve the software in terms of ease-of-use and its capabilities. If you have any suggestions/feedback, or need further assistance in getting the package working for your analysis, please email [email protected].

In future releases, we will be expanding the merge capabilities (currently, we only allow inner joins [equivalent to setting all = F in the merge function from base R]; future releases will allow more complex inner, left, right, and outer joins).

Acknowledgments

We thank Beniamino Green, Kosuke Imai, Gary King, Xiang Zhou, members of the Imai Research Workshop for valuable feedback. We also would like to thank Gil Tamir and Xiaolong Yang for excellent research assistance.

License

MIT License.

References

Brian Libgober, Connor T. Jerzak. "Linking Datasets on Organizations Using Half-a-billion Open-collaborated Records." Political Science Methods and Research, 2024. [PDF] [Data]

@article{libgober2024linking,
  title={Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records},
  author={Libgober, Brian and Connor T. Jerzak},
  journal={Political Science Methods and Research},
  year={2024},
  pages={},
  publisher={Cambridge University Press}
}

Related work

Green, Beniamino. "Zoomerjoin: Superlatively-Fast Fuzzy Joins." Journal of Open Source Software 8:89 5693-5698, 2023. [PDF]