Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
evaluate_ner_tagging.py		evaluate_ner_tagging.py
evaluate_question_answering.py		evaluate_question_answering.py
evaluate_text_classification.py		evaluate_text_classification.py
evaluate_text_generation.py		evaluate_text_generation.py
evaluation_engine.py		evaluation_engine.py
leaderboard_wrapper.py		leaderboard_wrapper.py

README.md

Evaluation

A transformation would be most effective when it can either reveal potential failures in a model or act as a data augmenter to generate more training data.

Evaluation Guideline and Scripts

To evaluate how good a transformation is, you can simply call evaluate.py in the following manner:

python evaluate.py -t ButterFingersPerturbation

Depending on the interface of the transformation, evaluate.py would transform every example of a pre-defined dataset and evaluate how well the model performs on these new examples. The default dataset and models are mentioned here. These dataset and model combinations are mapped to each task. The first task you specify in the tasks field is used by default. The task (-t), dataset (-d) and model (-m) can be overridden in the following way.

python evaluate.py -t ButterFingersPerturbation -task "TEXT_CLASSIFICATION" -m "aychang/roberta-base-imdb" -d "imdb"

Note that it's highly possible that some of the evaluate_* functionality won't work owing to the variety of dataset and model formats. We've tried to mitigate this by using models and datasets of HuggingFace. If you wish to evaluate on models and datasets apart from those mentioned here, you are welcome to do so. Do mention in your README how they turned out!

Note that it's highly possible that some of the evaluate_* functionality won't work owing to the variety of dataset and model formats. We've tried to mititgate this by using models and datasets which are commonly used. If you wish to evaluate on models and datasets apart from those mentioned here, you are free to do so. Do mention in your README how they turned out!

Leaderboard

Here, we provide a leaderboards for each default task, by executing transformations on typical models in each task. If you would like to join the leaderboard party encourage you to submit pull requests!

Text Classification

Transformation	roberta-base-SST-2	bert-base-uncased-QQP	roberta-large-mnli	roberta-base-imdb
BackTranslation	94.0->91.0 (-3.0)	92.0->90.0 (-2.0)	91.0->87.0 (-4.0)	95.0->92.0 (-3.0)
ButterFingersPerturbation	94.0->89.0 (-5.0)	92.0->89.0 (-3.0)	91.0->88.0 (-3.0)	95.0->93.0 (-2.0)
ChangePersonNamedEntities	94.0->94.0 (0.0)	92.0->92.0 (0.0)	91.0->89.0 (-2.0)	95.0->95.0 (0.0)
CloseHomophonesSwap	94.0->91.0 (-3.0)	92.0->88.0 (-4.0)	91.0->89.0 (-2.0)	95.0->96.0 (1.0)
DiscourseMarkerSubstitution	94.0->94.0 (0.0)	92.0->92.0 (0.0)	91.0->91.0 (0.0)	95.0->95.0 (0.0)
MixedLanguagePerturbation	94.0->90.0 (-4.0)	92.0->86.0 (-6.0)	91.0->86.0 (-5.0)	95.0->91.0 (-4.0)
PunctuationWithRules	94.0->94.0 (0.0)	92.0->92.0 (0.0)	91.0->91.0 (0.0)	95.0->90.0 (-5.0)
ReplaceNumericalValues	94.0->94.0 (0.0)	92.0->92.0 (0.0)	91.0->90.0 (-1.0)	95.0->95.0 (0.0)
SentenceReordering	94.0->95.0 (1.0)	92.0->93.0 (1.0)	nan	95.0->94.0 (-1.0)

Default models and datasets:

Text-to-Text Generation

Text Tagging

Dialog Action to Text

Table-to-Text

RDF-to-Text

RDF-to-RDF

Question Answering

Transformation	deepset/roberta-base-squad2	bert-large-uncased-whole-word-masking-finetuned-squad
RedundantContextForQa	5.6	-1.9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

evaluation

README.md

Evaluation

Table of Contents

Evaluation Guideline and Scripts

Leaderboard

Text Classification

Text-to-Text Generation

Text Tagging

Dialog Action to Text

Table-to-Text

RDF-to-Text

RDF-to-RDF

Question Answering

Question Generation

AMR-to-Text

End-to-End Task

Files

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

Evaluation

Table of Contents

Evaluation Guideline and Scripts

Leaderboard

Text Classification

Text-to-Text Generation

Text Tagging

Dialog Action to Text

Table-to-Text

RDF-to-Text

RDF-to-RDF

Question Answering

Question Generation

AMR-to-Text

End-to-End Task