Code, training data, and models to accompany the paper: Masis, Neal, Green, and O'Connor. "Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties." Field Matters Workshop at COLING, 2022.
Contact: Tessa Masis ([email protected]), Brendan O'Connor ([email protected])
-
data/
documentation.md
: dataset documentationCGEdit/
AAE.tsv
,IndE.tsv
: training sets for AAE and IndE generated via CGEdit method
CGEdit-ManualGen/
AAE.tsv
,IndE.tsv
: training sets for AAE and IndE generated via both ManualGen and CGEdit
-
code/
train.py
: code to fine-tune BERT-variant modeleval.py
: code to evaluate fine-tuned modelpreprocessCORAAL.py
: code used to preprocess CORAAL transcript files for extrinsic evaluation in the paper (see Section 6); note that only interviewee speech files were used for our evaluation, not interviewer speech files- Note that the above scripts may require modifications in order to run on your computer
tutorial.ipynb
: copy of the tutorial walking through how to use our fine-tuned models (see below, section "Using our models")
Run the train script with the contrast set generation method ('CGEdit' or 'CGEdit-ManualGen') as the first argument and the language ('AAE' or 'IndE') as the second argument. For example:
python train.py CGEdit-ManualGen AAE
The eval script will print a prediction in [0, 1] for each linguistic feature, for each test example.
Run the eval script with the contrast set generation method used for training ('CGEdit' or 'CGEdit-ManualGen') as the first argument, the language ('AAE' or 'IndE') as the second argument, and the test set filename as the third argument (not included in this repo). For example:
python eval.py CGEdit-ManualGen AAE testFileName
To access our fine-tuned model trained on the data in CGEdit-ManualGen/AAE.tsv
for 17 African American English features, please see the Google Colab notebook here (or see code/tutorial.ipynb
in this repo). This tutorial will walk you through how to access and use the model for linguistic feature detection.
Please contact us if you would like to access our model fine-tuned on the data in CGEdit-ManualGen/IndE.tsv
for 10 Indian English features.