This repository provides code for the SemEval-2020 Task 11 competition (Detection of Propaganda Techniques in News Articles).
The competition webpage: https://propaganda.qcri.org/semeval2020-task11/
The description of the architecture of models can be found in our paper aschern at SemEval-2020 Task 11: It Takes Three to Tango: RoBERTa, CRF, and Transfer Learning.
pip install -r ./requirements.txt
configs
: yaml configs for the systemdatasets
: contains the task datasets, which can be downloaded from the team competition webpageresults
: the folder for submissionsspan_identification
: code for the task SIner
: pytorch-transformers RoBERTa model with CRF (end-to-end)dataset
: the scripts for loading and preprocessing source datasetsubmission
: the scripts for obtaining and evaluating results
technique_classification
: code for the task TC (the folder has the same structure asspan_identification
)tools
: tools provided by the competition organizers; contain useful functions for reading datasets and evaluating submissionsvisualization_example
: example of visualization of results for both tasks
All commands are run from the root directory of the repository.
-
Configure
configs/si_config.yml
file, if it is needed. data_dir is the path to the cache of original train/eval sub-datasets and their BIO versions. In addition to using the config, it is also possible to specify arguments through the command line. -
Split the dataset for local evaluation (if
--overwrite_cache
, previous files will be replaced). It will produce files with the BIO-format tagging for spans (B-PROP, I-PROP, O) in your--data_dir
.python -m span_identification --config configs/si_config.yml --split_dataset --overwrite_cache
-
Train and eval model (the model parameters are specified in the config, you need to change the paths). The use of CRF is regulated by the flag
--use_crf
. For the first run you can use--model_name_or_path roberta-large
.python -m span_identification --config configs/si_config.yml --do_train --do_eval
-
Apply the trained model to the
test_file
(in BIO-format) specified in the config. It will be created based on thetest_data_folder
folder in case of missing or if the flag--overwrite_cache
is specified.python -m span_identification --config configs/si_config.yml --do_predict
-
Create the submission file
output_file
in theresult
folder. It will obtain spans from the result files with the token labeling specified inpredicted_labels_files
. At the aggregation stage, the span prediction results are simply joined.python -m span_identification --config configs/si_config.yml --create_submission_file
-
In case you have the correct markup in the
test_file
or gold--gold_annot_file
(source competition format), you can run the evaluation competition script.python -m span_identification --config configs/si_config.yml --do_eval_spans
-
Use
visualization_example/visualization.ipynb
if you want to visualize labels.
Here you need almost the same commands and settings as in the SI task.
-
Configure
configs/tc_config.yml
file, if it is needed. -
Split the dataset for local evaluation.
python -m technique_classification --config configs/tc_config.yml --split_dataset --overwrite_cache
-
Train and eval model. We used two setups with and without flags
--join_embeddings --use_length
(to get our RoBERTa-Joined). For the first run you can use--model_name_or_path roberta-large
.python -m technique_classification --config configs/tc_config.yml --do_train --do_eval
or distributed
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 technique_classification --config configs/tc_config.yml --do_train --do_eval
-
Apply the trained model to the
test_file
specified in the config. It will be created based on thetest_data_folder
folder andtest_template_labels_path
file in case of missing or if the flag--overwrite_cache
is specified.python -m technique_classification --config configs/tc_config.yml --do_predict --join_embeddings --use_length
-
Create the submission file
output_file
. It will combine predictions from the listpredicted_logits_files
with coefficients specified in--weights
(optional) and apply some post-processing.python -m technique_classification --config configs/tc_config.yml --create_submission_file
-
In case you have the correct markup in the
test_file
or gold--test_labels_path
(source competition format), you can check your accuracy (micro f1-score) and f1-score per classes.python -m technique_classification --config configs/tc_config.yml --eval_submission
-
Use
visualization_example/visualization.ipynb
if you want to visualize labels.
Our pretrained RoBERTa-CRF (SI task) and RoBERTa-Joined (TC task) models are available in Google Drive.