Examining whether pre-trained language models have understanding of structural alternations
Dependencies are managed using conda
. To set up the conda environment for the framework, issue the following command from within the structural-alternations
directory.
conda env create -f environment.yaml
Once the environment has been created, activate with conda activate salts
.
There are two main types of experiments which can be run using the structural-alternations
framework: non-tuning and tuning experiments. Within tuning experiments, there are two sub-types, new argument and new verb experiments. (There are also some additional scripts to aid in the setup of new verb experiments, cls_emb.py and check_args.py, to be discussed later.)
Non-tuning experiments involve taking an off-the-shelf pre-trained model and examining its logit distributions on masked language modeling (MLM) tasks for a variety of pre-determined tokens or token groups, allowing you to examine things like entropy or token-group confidence in particular positions in testing data for pre-trained BERT models. (Note that due to recent updates, non-tuning experiments may no longer work out of the box. This will be worked on later.)
Tuning experiments allow you to take a small set of tuning data and fine-tune a pre-trained model by introducing nonce words into the model's vocabulary. You can then test how the model performs on MLM tasks vis-a-vis its predictions on how these nonce tokens are used.
New argument experiments are a sub-type of tuning experiments that introduce new argument nouns into a models vocabulary.
New verb experiments introduce a novel verb into the model's vocabulary. Unlike new argument experiments, predictions are not collected on the novel word; instead, predictions are collected regarding the arguments of the novel verb. In other words, you can examine what the model has learned about the new verb by examining its predictions regarding its possible arguments in various structures pre- and post-tuning.
Configuration is handled using Hydra. Default values are specified in .yaml
files located in the conf
directory (and subdirectories). when running from the command line, default values are overridden using key=value
syntax. Additional explanation of how to flexibly specify parameters using Hydra can be found at hydra.cc.
The name of an option and its default value are listed here as name (default)
with explanations.
model (distilbert)
: which pretrained model to use. This should correspond to the name of a.yaml
file inconf/model
. This.yaml
file should contain:string_id
: the huggingface string id of the modelfriendly_name
: whatever you'd like to use as a non-string id name for the modelmodel_kwargs
: a dict of kwargs to pass to huggingface transformers'AutoModelForMaskedLM.from_pretrained
.tokenizer_kwargs
: a dict of kwargs to pass to huggingface transformers'AutoTokenizer.from_pretrained
.
tuning (dative_DO_give_active)
: which tuning data to use. This should correspond to the name of a.yaml
file inconf/tuning
. The contents of this.yaml
file will differ depending on whether you're running a new argument or a new verb experiment, and will be detailed below.override hydra/job_logging
: this points to a custom logger that allows the use of utf-8 in log files (instead of just ASCII). You shouldn't need to touch this.dev ([])
: a list of names of.yaml
files inconf/tuning
to use as dev sets during fine-tuning experiments. Average loss across all dev sets is used to determine early stopping. In addition to any dev sets provided, the training set with dropout disabled and novel token masking enabled is always used as a dev set. A special option fordev
,best_matches
, uses the filename of the current tuning file as a base, and finds all tuning files that differ from it in one string when split by underscores (e.g.tuning=dative_DO_give_active dev=best_matches
would usedative_DO_send_active
,dative_DO_give_passive
,dative_PD_give_active
, anddative_DO_mail_active
as dev sets).dev_exclude ([])
: a list of strings to use to exclude dev sets when usingdev=best_matches
. A dev set containing any string indev_exclude
will not be included in the dev sets, even if it would be selected by thebest_matches
criterion.n (0)
: how many models to fine-tune using the specified options with different seeds for the randomly initialized novel token embeddings. Only used when using Hydra's-m/--multirun
option.hyperparameters
: a dict containing hyperparameters, which are the following.lr (0.001)
: the learning rate. If using gradual or complete unfreezing, it is recommended to set this to0.0001
instead of the default; if unfreezing only the embeddings of the novel tokens, the default works better.max_epochs (70)
: the maximum number of epochs to fine-tune for.min_epochs (=max_epochs)
: the minimum number of epochs to fine-tune for.patience (=max_epochs)
: how many epochs to continue fine-tuning for with no improvement on average loss across the dev sets.delta (0)
: how much of an improvement on average dev loss is sufficient to allow training to continue.0
means any improvement, no matter how small, resets the patience counter.masked_tuning_style (always)
: how to mask the novel tokens. Possible options arealways
,bert
,roberta
ornone
.always
: always mask the novel tokens.bert
: before fine-tuning, decide to mask 80% of the novel tokens in the fine-tuning data, leave 10% intact, and replace 10% with a random token from the model's vocabulary.roberta
: likebert
, but rerun the decision about what to do with each novel token every epoch.none
: do not mask the novel tokens.
strip_punct (false)
: whether to remove most punctuation from fine-tuning data. Punctuation that is not stripped is[]<>,
.unfreezing (none)
: one ofgradual{int}
,mixout{float}
,{int}
,complete
, ornone
. When usingunfreezing=none
, only the weights of the novel tokens are updated, and only those are saved. Because this requires much less space than saving the full model, weights are saved for every epoch. When using any other option, only the full model checkpoint with the lowest average dev loss will be saved.gradual{int}
: gradual unfreezing unfreezes one layer of the model at a time, starting from the highest numbered layer and proceeding backward until all layers are unfrozen.{int}
should be replaced with an integer specifying how many epochs to wait between unfreezing one layer and the previous layer. If no integer is provided, the default is 1 epoch (i.e.,hyperparameters.unfreezing=gradual
is equivalent tohyperparameters.unfreezing=gradual1
).mixout{float}
: mixout unfreezing completely unfreezes the model, but replaces dropout layers with mixout layers. Mixout layers randomly replace a parameter with the original model's parameter with a probability of{float}
.{int}
: an integer specifying the highest layer of the model that should remain unfrozen. E.g.,unfreezing=6
means that layers 0--6 are frozen, and layers 7+ are unfrozen.complete
: unfreeze all model parameters, including word embeddings. Note that the previous options do not unfreeze word embeddings, but only hidden layers.none
: do not unfreeze any model parameters except for the embeddings of the novel tokens (which must be unfrozen for any learning to take place).
mask_args (false)
: whether to mask argument positions in new verb experiments, separately from whether to mask the novel verb (which is set byhyperparameters.masked_tuning_style
). Only used for new verb experiments.use_kl_baseline_loss (false)
: whether to use a loss term that combines the default cross entropy loss with a loss based on the KL divergence of the predictions of the model being fine-tuned and the predictions of the pre-fine-tuning version of that model. Only used if settinghyperparameters.unfreezing
to anything other thannone
.
kl_loss_params
: ifhyperparameters.use_kl_baseline_loss
is set totrue
, these control how that loss is calculated.dataset (datamaker/datasets/miniboki-2022-04-01_22-58-30/miniboki)
: a directory containing a dataset in huggingface's datasets format with sentences to use to compute the KL divergence term. The term compares the distribution of the model being fine-tuned to a baseline, non-fine-tuned version of the model to minimize the divergence between the predictions of the two. The default is a dataset of 10,000 sentences constructed to mimic BERT's pretraining dataset, using current data from Wikipedia and Bookcorpus in the same ratio as the occurred in BERT's pretrained dataset.n_examples_per_step (100)
: how many randomly chosen examples from the dataset to use when calculating the KL loss term every epoch.scaleby (0.5)
: a multiplier for the KL divergence loss term to control how much to weight it relative to the default cross entropy loss.masking (none)
: how to mask inputs when calculating the KL loss divergence. When usingkl_loss_params.masking=none
, KL divergence is calculated based on predictions for the entire sentence (i.e., input token sequence); with other values, it is calculated based only on the mask tokens, to mimic BERT's pretraining objective.always
: randomly choose 15% of tokens in the input sentences and mask them.bert
: randomly choose 15% of tokens in the input sentences. Of those, mask 80%, do nothing to 10%, and replace 10% with a random token from the model's vocabulary (not including the novel tokens).none
: do not mask any tokens.
debug (false)
: whether to log predictions for sample sentences every epoch during fine-tuning.use_gpu (false)
: whether to use GPU support. if a GPU is not available, this will automatically be set to false. If you save a model fine-tuned using a GPU, you will still be able to load it for evaluation on a CPU.
-
data (syn_give_give_ext)
: which data to use for evaluation. This should correspond to the name of a file inconf/data
, which should include the following information.name
: the name of the dataset (including the file extension). This should correspond to the name of a file in./data
. A dataset consists of rows with lists of sentences separated by,
(a space, followed by a comma, followed by a space). If we think of this like a CSV, rows correspond to different examples of sentence types, which each column corresponding to a single sentence type.description
: a description of the dataset.sentence_types
: a list of the sentence types in the dataset, with one for each column.eval_groups
: a dict mapping a label to the novel tokens.to_mask
: a list containing the novel tokens.
For new argument experiments only:
masked_token_targets
: a dict mapping each novel token to a list of existing tokens to compare it to. Used to get tSNEs and cosine similarities between each token and its targets to compare the learned embeddings to the embeddings of the existing tokens.masked_token_target_labels
: a dict mapping each novel token to a label for the target group it is being compared to.
For new verb experiments only:
-
added_args
: a list of dicts specifying additional in-group but out-of-training arguments to include during evaluation. The dict key should correspond to an arg group from the tuning file, and its value should be a dict mapping argument types to a list of strings of additional arguments to include in that group during evaluation. -
prediction_sentences
: sentences to log and save full model predictions for during evaluation.
-
override hydra/job_logging
: this points to a custom logger that allows the use of utf-8 in log files (instead of just ASCII). You shouldn't need to touch this. -
criteria (all)
: a comma separate list of strings passed as a single string. To be included in evaluation, a model's checkpoint directory must contain all these strings.all
is a special value meaning no exclusions. -
create_plots (true)
: whether to create plots of tSNEs, cosine similarities, and odds ratios. You can skip plot creation to save time and just get the CSVs. -
epoch (best_mean)
: which epoch to evaluate the model at. If using anyhyperparameters.unfreezing
other thannone
, this can only be0
orbest_mean
. If usinghyperparameters.unfreezing=none
, other options are available. Pass an integer to evaluate the model at that epoch. Passbest_mean
to evaluate the model at the state with the lowest average dev loss. Passbest_sumsq
to evaluate the model at an epoch where at least one dev set is at its lowest loss, and the difference between performance on the dev sets is minimized. -
topk_mask_token_predictions ()
: how many of the top predictions to get for masked tokens indata.prediction_sentences
. -
k (50)
: find and save the k subword tokens with the most similar embeddings to the novel embeddings (using cosine similarity). -
num_tsne_words (500)
: plot the first two tSNE components of the first n tokens in the model vocubulary and the novel tokens (to compare the learned representations of the novel tokens to the learned representations of existing tokens). -
comparison_dataset (datamaker/datasets/miniboki-2022-04-01_22-58-30/miniboki)
: if provided, compare the fine-tuned model's predictions to the same model pre-fine-tuning on this dataset. The default is as described above in the options fortune.yaml
,kl_loss_params.dataset
. -
comparison_n_exs (100)
: how many sentences to draw from the dataset to calculate KL divergence on. -
comparison_masking (none)
: how mask tokens in sentences during comparison to the model baselines. Options are the same as those described intune.yaml
,kl_loss_params.masking
. -
dir ()
: a directory containing subdirectories (arbitrarily nested) with model checkpoints to evaluate. All subdirectories ofdir
containing valid model checkpoints will be evaluated. -
summarize (false)
: whether to summarize when evaluating multiple models in the same run. Summarization involves average predictions for the most similar tokens, cosine similarities for target tokens, and odds ratios. For odds ratios comparisons, the models predictions for each kind of evaluation token group is reduced to a single point which represents the mean of that group. -
debug (false)
: whether to log model predictions for sample sentences in addition to thedata.prediction_sentences
. -
use_gpu (false)
: whether to use GPU support. -
rerun (false)
: whether to rerun evaluations on directories already containing the expected number of results files.
For both new argument and new verb experiments, the following should be specified in a tuning config file.
name
: the name of the tuning dataset.reference_sentence_type
: the name of the reference sentence type, which is the kind of sentence included in the fine-tuning dataset. This should match whatever you call the same sentence type in the evaluation data file; plots are drawn to compare each other sentence type to the reference sentence type.exp_type
: for new argument experiments,newarg
; for new verb experiments,newverb
.to_mask
: a list of the novel tokens to mask in the input sentences.data
: a list of sentences to use as fine-tuning data.
In addition, new verb experiments should specify more options.
num_words
: used only bycheck_args.py
, not during fine-tuning. Specifies how many of the best generated predictions to display for each argument type.which_args
: which set of arguments to use during fine-tuning. This should corresponding to the name of an option specified in the file, ormodel
, which uses the arguments corresponding to the model's friendly name.check_args_data
: a list of sentences to use when generating sets of arguments usingcheck_args.py
. Not used during fine-tuning.
Argument sets are specified as a dictionary mapping an argument label to a list of arguments to substitute for that label during fine-tuning. You can also specify a specific random seed to use when initializing the new verb token for the model; this is useful when using the arguments generated using check_args.py
, as it ensures that the random state used to generate the unbiased arguments matches the one used during fine-tuning. The random seed is used when tuning.which_args
is set to model
, the actual model's name (i.e., model=bert tuning.which_args=bert
), or when using best_average
or most_similar
.
check_args.py
provides an interface for generating sets of argument nouns that start off as unbiased toward a particular structural position. Simply choosing nouns like woman
and lawyer
to be subjects and file
and computer
as objects would make results of new verb experiments harder to evaluate: does the model do well because it generically predicts the former two nouns to be more likely in subject positions compared to object positions and vice versa for the latter two, or was it learned on the basis of the fine-tuning data? By starting off with nouns that are unbiased toward particular argument positions, it is easier to be sure that any improvement could not be due to pre-existing knowledge of specific tokens, but instead would have to represent generalizations across structural contexts. All candidate arguments must be tokenized as a single token in all models.
The following are the configuration options for check_args.py
, specified in check_args.yaml
.
-
dataset_loc (conf/subtlex_freqs_formatted.csv.gz)
: the location of a dataset containing words along with frequency information to pull candidate words from. Currently only SUBTLEX is supported, though it would be easy to add support for other datasets. -
tunings ([])
: a list of strings corresponding to tuning files inconf/tuning
. Data (including thecheck_args_data
) from these tuning files is used to determine which arguments are unbiased toward specific structural positions. -
target_freq (any)
: the target frequency for candidate nouns.any
means all nouns, regardless of frequency of occurrence, are possible candidates. -
range (1000)
: a noun's frequency must be within +/- this number of the target freq to be a possible candidate. Only used iftarget_freq
is not set toany
. -
min_length (4)
: the minimum acceptable length in characters for a candidate noun to be a candidate. Changing this can help avoid unwanted things like acronyms and abbreviations. -
strip_punct (false)
: whether to remove punctuation from sentences when checking argument bias. -
patterns
: a dict mapping argument types to a list of indices. When outputting anargs.yaml
file containing argument sets for each model, the top least biasedtuning.num_words
arguments$\times$ the number of argument types will be separated into each argument list according to the indices here.
The output of check_args.py
includes model predictions for each argument which are the log odds ratio of a noun in one argument position compared to every other argument position, a correlation heatmap comparing predictions across each pair of models, and a file args.yaml
that has argument sets formatted in such a way that they can be copy-pasted into a tuning configuration file and used for fine-tuning.
Arguments are considered less biased if the average log odds ratio of them occurring in the position of one argument type versus another is closer to 0.
cls_emb.py
fits a Support Vector Machine to embeddings of a model tokens, to determine whether they are linearly separable. Originally, this was included because we wanted to ensure that it was possible linearly separate the subject and object nouns generated by check_args.py
. However, this is not really needed in practice, because as it turns out, it is quite trivial to linearly separate two groups of six embeddings in a 768-dimensional embedding space.
model (distilbert)
: which model's embeddings to classify.tuning (newverb_transitive_ext)
: which tuning file to use to find sets of argument tokens to classify. You can settuning.which_args
to change which set of arguments' embeddings to fit the SVM to.
In order to tune a new model, run python tune.py
. To override the default choices any option, specify them as key=value
arguments to the tuning script, such as:
python tune.py model=bert tuning=untuned
This script will pull a vanilla BERT model from HuggingFace, do nothing with it (i.e., tune it on an empty dataset), and save the randomly initialized weights of any tokens marked as to be masked to the outputs directory.
When you actually tune a model, a file containing the weights of each token in the tuning configuration's to_mask
list for each epoch (including the randomly initialized weights) will be saved in the outputs directory. In addition, a CSV file containing various training metrics for each epoch with be saved in the outputs directory, along with a PDF containing plots of these metrics during fine-tuning.
In order to evaluate a single model's performance, run python eval.py
. To override the default choices for dir
and evaluation data
, specify them as key=value
arguments to the tuning script, as above. This outputs CSVs and a pickle file containing a dataframe that reports the evaluation results in a subdirectory of each models' checkpoint dir, as well as various plots.
Outputs include files with the following:
- the model's predictions regarding the log odds ratio of each token in its expected position compared to unexpected positions. Plots comparing these for the reference sentence type against every other sentence type are output as well.
- accuracy information for each sentence type compared to the reference sentence type. Accuracy is defined as having a log odds ratio >0 for a novel token in the expected position compared to unexpected positions.
- the most similar tokens to the novel tokens after fine-tuning, using cosine similarity as a measure. The cosine similarities of the tokens in the target groups are included as well. If there are at least two novel tokens, plots of these are included as well.
- the first two tSNE components of the novel tokens, the first
num_tsne_tokens
in the model vocabulary, and the target tokens. Plots of these are included as well. - If a comparison dataset is provided, information about the KL divergences for the
comparison_n_exs
examples are output, along with a histogram. - Saved model outputs for the prediction sentencess, as well as a CSV with information about these.
If summarize=true
, files summarizing this information across multiple model will be output in a subdirectory of dir
. The cosine similarities summary gets the average cosine similarity between novel tokens and the target tokens across models. The log odds ratios summary condenses the models' predictions to a single point for each novel token type, with an accuracy file generated based on these averages showing the proportion of models with means for each token type that are accurate. Plots of these are included as well.