GroC — a Pytorch implementation of GroC, a grounded compositional output model for adaptive language modeling presented at EMNLP 2020 [1]. The model has a fully compositional output embedding layer that is optionally further grounded in information from a structured lexicon (WordNet), namely semantically related words and free-text definitions. It can be applied to both conventional language modeling as well as challenging cross-domain settings with an open vocabulary.

@inproceedings{pappas-etal-2020-grounded,
    title = "Grounded Compositional Outputs for Adaptive Language Modeling",
    author = "Pappas, Nikolaos and Mulcaire, Phoebe and Smith, Noah A.",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.96",
    pages = "1252--1267",
    year = "2020"
}

Installation

Note that this repository is based on the codebase of awd-lstm, for a general purpose language model consider using that library directly [2]. Before running the code make sure you have installed the conda environment first:

conda env create -f environment.yml

Data

To obtain the datasets for conventional language modeling please follow the instructions from [2], e.g. the datasets can be easily obtained by running this script.

Example:

echo "- Downloading Penn Treebank (PTB)"
wget --quiet --continue http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
tar -xzf simple-examples.tgz``
mkdir -p penn
cd penn
mv ../simple-examples/data/ptb.train.txt train.txt
mv ../simple-examples/data/ptb.test.txt test.txt
mv ../simple-examples/data/ptb.valid.txt valid.txt
cd ..
rm -rf simple-examples/

Training

Below you can find a few example training commands for training the baseline language model or GroC with different options.

Baseline (tied)

python -W ignore  main.py --data penn --dropouti 0.4 --dropouth 0.25 --seed 28 --batch_size 20 --epoch 1000\
 --save tied --cuda --cuda_device 0

GroC (char)

python -W ignore  main.py --data penn --dropouti 0.4 --dropouth 0.25 --seed 28 --batch_size 20 --epoch 1000\
--save groc_char --char_emb --char_update_ratio 0.3 --cuda --cuda_device 0

GroC (char, rel, def)

python -W ignore  main.py --data penn --dropouti 0.4 --dropouth 0.25 --seed 28 --batch_size 20 \
--epoch 1000 --save groc_full --char_emb --rel_emb --def_emb --char_update_ratio 0.3 --cuda --cuda_device 0

GroC for adaptation (char, rel, def, deep residual net, bias estimator)

python -W ignore  main.py --data penn --dropouti 0.4 --dropouth 0.25 --seed 28 --batch_size 20 \
--epoch 1000 --save groc_full --char_emb --predict_bias --joint_emb 400 --joint_emb_depth 4 --joint_dropout 0.6\ 
--joint_locked_dropout --joint_emb_activation Sigmoid --char_update_ratio 0.3 --cuda --cuda_device 0

For those who are interested, we also make our pretrained models and configurations publicly available for our experiments on conventional language modeling and cross-domain language modeling: Pretrained GroC models (Google Drive).

Evaluation

Download the News Crawl data here (http://statmt.org/wmt14/translation-task.html, under “Monolingual language model training data”). Place the downloaded files in a directory called “raw” and run scripts/create-data.sh to recreate our splits.

python evaluate.py --test_data data/news2007_train.news2008_test/ --save saved-models/[our model] --cuda  
--cuda_device 0 --seed 1234 --adapt_method [change_vocab|interpolate_neural|interpolate_unigram]

References

[1] Nikolaos Pappas, Phoebe Mulcaire, Noah A. Smith Grounded Compositional Outputs for Adaptive Language Modeling, Conference on Empirical Methods in Natural Language Processing, 2020
[2] Stephen Merity, Nitish Shirish Keskar, Richard Socher, Regularizing and Optimizing LSTM Language Models, Sixth International Conference on Learning Representations, Vancouver, Canada, 2018

Contact

For questions and requests please email: npappas@cs.washington.edu, pmulcaire@cs.washington.edu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Installation

Data

Training

Evaluation

References

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Installation

Data

Training

Evaluation

References

Contact