A transformation would be most effective when it can either reveal potential failures in a model or act as a data augmenter to generate more training data.
To evaluate how good a transformation is, you can simply call evaluate.py
in the following manner:
python evaluate.py -t ButterFingersPerturbation
Depending on the interface of the transformation, evaluate.py
would transform every example of a pre-defined dataset and evaluate how well the model performs on these new examples. The default dataset and models are mentioned here. These dataset and model combinations are mapped to each task. The first task you specify in the tasks
field is used by default.
The task (-t
), dataset (-d
) and model (-m
) can be overridden in the following way.
python evaluate.py -t ButterFingersPerturbation -task "TEXT_CLASSIFICATION" -m "aychang/roberta-base-imdb" -d "imdb"
Note that it's highly possible that some of the evaluate_* functionality won't work owing to the variety of dataset and model formats. We've tried to mitigate this by using models and datasets of HuggingFace. If you wish to evaluate on models and datasets apart from those mentioned here, you are welcome to do so. Do mention in your README how they turned out!
Note that it's highly possible that some of the evaluate_* functionality won't work owing to the variety of dataset and model formats. We've tried to mititgate this by using models and datasets which are commonly used. If you wish to evaluate on models and datasets apart from those mentioned here, you are free to do so. Do mention in your README
how they turned out!
Here, we provide a leaderboards for each default task, by executing transformations on typical models in each task. If you would like to join the leaderboard party encourage you to submit pull requests!
Transformation | roberta-base-SST-2 | bert-base-uncased-QQP | roberta-large-mnli | roberta-base-imdb |
---|---|---|---|---|
BackTranslation | 94.0->91.0 (-3.0) | 92.0->90.0 (-2.0) | 91.0->87.0 (-4.0) | 95.0->92.0 (-3.0) |
ButterFingersPerturbation | 94.0->89.0 (-5.0) | 92.0->89.0 (-3.0) | 91.0->88.0 (-3.0) | 95.0->93.0 (-2.0) |
ChangePersonNamedEntities | 94.0->94.0 (0.0) | 92.0->92.0 (0.0) | 91.0->89.0 (-2.0) | 95.0->95.0 (0.0) |
CloseHomophonesSwap | 94.0->91.0 (-3.0) | 92.0->88.0 (-4.0) | 91.0->89.0 (-2.0) | 95.0->96.0 (1.0) |
DiscourseMarkerSubstitution | 94.0->94.0 (0.0) | 92.0->92.0 (0.0) | 91.0->91.0 (0.0) | 95.0->95.0 (0.0) |
MixedLanguagePerturbation | 94.0->90.0 (-4.0) | 92.0->86.0 (-6.0) | 91.0->86.0 (-5.0) | 95.0->91.0 (-4.0) |
PunctuationWithRules | 94.0->94.0 (0.0) | 92.0->92.0 (0.0) | 91.0->91.0 (0.0) | 95.0->90.0 (-5.0) |
ReplaceNumericalValues | 94.0->94.0 (0.0) | 92.0->92.0 (0.0) | 91.0->90.0 (-1.0) | 95.0->95.0 (0.0) |
SentenceReordering | 94.0->95.0 (1.0) | 92.0->93.0 (1.0) | nan | 95.0->94.0 (-1.0) |
Default models and datasets:
- SST-2: textattack/roberta-base-SST-2
- IMDB: textattack/roberta-base-imdb
- QQP: textattack/bert-base-uncased-QQP
- MNLI: roberta-large-mnli
Transformation | deepset/roberta-base-squad2 | bert-large-uncased-whole-word-masking-finetuned-squad |
---|---|---|
RedundantContextForQa | 5.6 | -1.9 |