This repo contains code and data for the paper 'Machine-Guided Polymer Knowledge Extraction from the Literature Using Natural Language Processing: The Example of Named Entity Normalization' [1].
- Python 3.6+
- Pytorch (version 1.5.0)
- scikit-learn (version 0.22.1)
- spacy (version 2.1.6)
You can install all required Python packages using the provided env.yml file using conda env create -f env.yml
The code for normalization has been adapted from [https://github.com/iesl/expLinkage] [2]. Some of the major changes include addition of the parameterized cosine distance metric and addition of a test mode for prediction of clusters for zero-shot data. The following commands can be used to replicate the experiments in the paper.
To train the supervised clustering model in our paper
python src/trainer/train_vect_data.py --config="src/utils/Config.py" --mode="train" --resultDir="/path/to/output_dir" --clusterFile="data/input_data/fastText/labeled_polymer_clusters.tsv"
To train the baseline model described in our paper
python src/baseline/baseline_train.py --labeled_file="ata/input_data/fastText/labeled_polymer_clusters_with_name.tsv" --use_labels=True --output_dir="path/to/output_dir"
[1] Shetty, Pranav, and Rampi Ramprasad. "Machine-Guided Polymer Knowledge Extraction Using Natural Language Processing: The Example of Named Entity Normalization." Journal of Chemical Information and Modeling (2021).
[2] Yadav, Nishant, et al. "Supervised hierarchical clustering with exponential linkage." International Conference on Machine Learning. PMLR, 2019