Skip to content

Latest commit

 

History

History
79 lines (53 loc) · 3.59 KB

README.md

File metadata and controls

79 lines (53 loc) · 3.59 KB

vsc

The source code of paper: Verb Sense Clustering using Contextualized Word Representations for Semantic Frame Induction, accepted to ACL-IJCNLP Findings 2021.

Installation

# Before installation, upgrade pip and setuptools.
$ pip install -U pip setuptools

# Install other dependencies.
$ pip install -r requirements.txt

# Install the vsc package.
$ pip install .
# Or if you want to install it in editable mode:
$ pip install -e .

Usage

All scripts to run the source codes are in script/. The file names of the scripts are (directory name)_(file name).sh, respectively.

Before you start, you need to download the annotated data, FrameNet and PropBank. Note the file name if you use the source code directly. FrameNet and PropBank can also be downloaded from the NLTK library, but they differ from the code we used and require careful preprocessing. To make it run as scripted, put the data in data/raw, like data/raw/fndata-1.7 or data/raw/ontonotes.

Also, you will need the files of the pre-trained models (Original) on this site to experiment with ELMo. To make it run as scripted, put the data in data/raw/elmo. The file names would be elmo_2x4096_512_2048cnn_2xhighway_options.json and elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5.

1. Preprocessing (preprocessing/)

You extract examples, Lexical Units, frames, etc. from XML files for FrameNet (extract_exemplars_framenet.py) and PropBank (extract_exemplars_propbank.py). See script/preprocessing_extract_exemplars_*.sh when running these.

In addition, frame-to-frame relationship data used in the experiment is extracted from the XML file in FrameNet (make_relation_list.py).

2. Experiment on Frame Distinction (experiment_frame_distinction/)

First, you need to make datasets for this experiment (make_dataset.py). Next, the contextualized wordembeddings of the target verbs are obtained (get_embeddings.py). The use of GPUs is recommended here. Then, frame distinction can be performed by clustering on the basis of the embeddings (verb_sense_clustering.py).

You can aggregate results by focusing on FrameNet frame-to-frame relationships (aggregate_relations.py). You can also visualize the contextualized word embedding of the target verb in two dimensions (visualize_embeddings.py).

3. Experiment on Frame Number Estimation (experiment_frame_number_estimation/)

First, you need to make datasets for this experiment (make_dataset.py). Next, the contextualized word embeddings of the target verbs are obtained (get_embeddings.py). The use of GPUs is recommended here. Then, frame number estimation can be performed by clustering on the basis of the embeddings (verb_sense_clustering.py).

Citation

Please cite our paper if this source code is helpful in your work.

@inproceedings{yamada-etal-2021-verb,
    title = "Verb Sense Clustering using Contextualized Word Representations for Semantic Frame Induction",
    author = "Yamada, Kosuke  and
      Sasano, Ryohei  and
      Takeda, Koichi",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    year = "2021",
    url = "https://aclanthology.org/2021.findings-acl.381",
    pages = "4353--4362",
}