Skip to content

firojalam/COVID-19-tweets-for-check-worthiness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 

Repository files navigation

COVID-19 Infodemic Twitter Dataset

This repository contains a dataset consisting of tweets annotated with fine-grained labels related to disinformation about COVID-19. The labels answer seven different questions that are of interests to journalists, fact-checkers, social media platforms, policymakers, and society as a whole. There are annotations for Arabic and English.

To label the dataset, we prepared comprehensive annotation guidelines [1], which can help similar tasks in different domains. Moreover, we launched an annotation platform to label tweets, where anyone can contribute and help increase the size of the dataset, which we will be updating here periodically.

Table of contents:

Help the community to label more data

We also invite you to join us to label tweets related to COVID-19 disinformation.

To annotate we recommend you to register to micromapper and then login for the annotation. However, one can annotate with any registration.

  1. Please go to any of the the following links
  • English
  • Arabic
    Then, either click Start Contributing Now or Contribute. This will lead to a page with annotation instructions. Please, scroll down and click Start contributing.
  1. You can now start annotating.

An example of the annotation page looks as follows: Example

Questions with Labels

Below is the list of the questions and the possible labels (answers). See the paper below or the above micromappers links for detailed definition of the annotation guidelines.

1. Does the tweet contain a verifiable factual claim?
Labels:

  • YES: if it contains a verifiable factual claim;
  • NO: if it does not contain a verifiable factual claim;
  • Don’t know or can’t judge: the content of the tweet does not have enough information to make a judgment. It is recommended to categorize the tweet using this label when the content of the tweet is not understandable at all. For example, it uses a language (i.e., non-English) or references that are difficult to understand;

2. To what extent does the tweet appear to contain false information?
Labels:

  1. NO, definitely contains no false information
  2. NO, probably contains no false information
  3. Not sure
  4. YES, probably contains false information
  5. YES, definitely contains false information

3. Will the tweet’s claim have an effect on or be of interest to the general public?
Labels:

  1. NO, definitely not of interest
  2. NO, probably not of interest
  3. Not sure
  4. YES, probably of interest
  5. YES, definitely of interest

4. To what extent does the tweet appear to be harmful to society, person(s), company(s) or product(s)?
Labels:

  1. NO, definitely not harmful
  2. NO, probably not harmful
  3. Not sure
  4. YES, probably harmful
  5. YES, definitely harmful

5. Do you think that a professional fact-checker should verify the claim in the tweet?
Labels:

  1. NO, no need to check
  2. NO, too trivial to check
  3. YES, not urgent
  4. YES, very urgent
  5. Not sure

6. Is the tweet harmful for society and why?
Labels:

  1. NO, not harmful
  2. NO, joke or sarcasm
  3. Not sure
  4. YES, panic
  5. YES, xenophobic, racist, prejudices, or hate-speech
  6. YES, bad cure
  7. YES, rumor or conspiracy
  8. YES, other

7. Do you think that this tweet should get the attention of a government entity?
Labels:

  1. NO, not interesting
  2. Not sure
  3. YES, categorized as in question 6
  4. YES, other
  5. YES, blame authorities
  6. YES, contains advice
  7. YES, calls for action
  8. YES, discusses action taken
  9. YES, discusses cure
  10. YES, asks question

List of Versions

===================
v1.0 [2020/05/01]: initial distribution of the annotated dataset

  • English data: 504 tweets
  • Arabic data: 218 tweets

Contents of the Distribution

===============================================

Directory Structure

=======================

The directory contains the following two sub-directories:

  • Readme.txt this file
  1. "English": This directory contains tab-separated values (i.e., TSV) file, and one JSON file. The TSV file stores ground-truth annotations for the aforementioned tasks. The data format of these files is described in detail below. Each line in the JSON file corresponds to data from a single tweet stored in JSON format (as downloaded from Twitter).

  2. "Arabic": Similarly to English, this directory contains one TSV file and one JSON file using the same format.

Format of the TSV files under the "annotations" directory

Each TSV file in this directory contains the following columns, separated by a tab:

  • tweet_id: corresponds to the actual tweet id from Twitter.
  • tweet_text: corresponds to the original text of a given tweet as downloaded from Twitter.
  • q*_label (column 3-9): corresponds to the label for question 1 to 7.

Note that there are NA (i.e., null) entries in the TSV files that simply indicate "not applicable" cases. We label NA for question 2 to 5 when question 1 is labeled as NO.

Examples

============

Please don't take hydroxychloroquine (Plaquenil) plus Azithromycin for #COVID19 UNLESS your doctor prescribes it. Both drugs affect the QT interval of your heart and can lead to arrhythmias and sudden death, especially if you are taking other meds or have a heart condition.
Labels:

  1. Q1: Yes;
  2. Q2: NO: probably contains no false info
  3. Q3: YES: definitely of interest
  4. Q4: NO: probably not harmful
  5. Q5: YES:very-urgent
  6. Q6: NO:not-harmful
  7. Q7: NO: YES:discusses_cure

BREAKING: @MBuhari’s Chief Of Staff, Abba Kyari, Reportedly Sick, Suspected Of Contracting #Coronavirus | Sahara Reporters A top government source told SR on Monday that Kyari has been seriously “down” since returning from a trip abroad. READ MORE: https://t.co/Acy5NcbMzQ https://t.co/kStp4cmFlr.
Labels:

  1. Q1: Yes;
  2. Q2: NO: probably contains no false info
  3. Q3: YES: definitely of interest
  4. Q4: NO: definitely not harmful
  5. Q5: YES:not-urgent
  6. Q6: YES:rumor
  7. NO: YES:classified_as_in_question_6

Statistics

=============
Some statistics about the dataset

English tweets:

  1. Q1 = 504 labeled tweets
  • no 209
  • yes 295
  1. Q2 = 295 labeled tweets
  • 1_no_definitely_contains_no_false_info 47
  • 2_no_probably_contains_no_false_info 171
  • 3_not_sure 40
  • 4_yes_probably_contains_false_info 25
  • 5_yes_definitely_contains_false_info 12
  1. Q3 = 295 labeled tweets
  • 1_no_definitely_not_of_interest 9
  • 2_no_probably_not_of_interest 44
  • 3_not_sure 7
  • 4_yes_probably_of_interest 177
  • 5_yes_definitely_of_interest 58
  1. Q4 = 295 labeled tweets
  • 1_no_definitely_not_harmful 106
  • 2_no_probably_not_harmful 66
  • 3_not_sure 2
  • 4_yes_probably_harmful 67
  • 5_yes_definitely_harmful 54
  1. Q5 = 295 labeled tweets
  • no_no_need_to_check 77
  • no_too_trivial_to_check 57
  • yes_not_urgent 112
  • yes_very_urgent 49
  1. Q6 = 504 labeled tweets
  • no_joke_or_sarcasm 62
  • no_not_harmful 333
  • not_sure 2
  • yes_bad_cure 3
  • yes_other 25
  • yes_panic 23
  • yes_rumor_conspiracy 42
  • yes_xenophobic_racist_prejudices_or_hate_speech 14
  1. Q7 = 504 labeled tweets
  • no_not_interesting 319
  • not_sure 6
  • yes_asks_question 2
  • yes_blame_authorities 81
  • yes_calls_for_action 8
  • yes_classified_as_in_question_6 34
  • yes_contains_advice 9
  • yes_discusses_action_taken 12
  • yes_discusses_cure 5
  • yes_other 28

Arabic tweets:

  1. Q1 = 218 labeled tweets
  • no 78
  • yes 140
  1. Q2 = 140 labeled tweets
  • 1_no_definitely_contains_no_false_info 31
  • 2_no_probably_contains_no_false_info 62
  • 3_not_sure 5
  • 4_yes_probably_contains_false_info 40
  • 5_yes_definitely_contains_false_info 2
  1. Q3 = 140 labeled tweets
  • 1_no_definitely_not_of_interest 1
  • 2_no_probably_not_of_interest 5
  • 3_not_sure 9
  • 4_yes_probably_of_interest 76
  • 5_yes_definitely_of_interest 49
  1. Q4 = 140 labeled tweets
  • 1_no_definitely_not_harmful 68
  • 2_no_probably_not_harmful 21
  • 3_not_sure 3
  • 4_yes_probably_harmful 46
  • 5_yes_definitely_harmful 2
  1. Q5 = 140 labeled tweets
  • no_no_need_to_check 22
  • no_too_trivial_to_check 55
  • yes_not_urgent 48
  • yes_very_urgent 15
  1. Q6 = 218 labeled tweets
  • no_joke_or_sarcasm 2
  • no_not_harmful 159
  • yes_bad_cure 1
  • yes_other 5
  • yes_panic 12
  • yes_rumor_conspiracy 33
  • yes_xenophobic_racist_prejudices_or_hate_speech 6
  1. Q7 = 218 labeled tweets
  • no_not_interesting 163
  • yes_blame_authorities 13
  • yes_calls_for_action 1
  • yes_classified_as_in_question_6 30
  • yes_contains_advice 1
  • yes_discusses_cure 6
  • yes_other 4

Download

To download the dataset, just fill up this form.

Publications:

Please cite the following papers if you are using the data or annotation guidelines

  1. Firoj Alam, Fahim Dalvi, Shaden Shaar, Nadir Durrani, Hamdy Mubarak, Alex Nikolov, Giovanni Da San Martino,3Ahmed Abdelali,1Hassan Sajjad,1Kareem Darwish,1Preslav Nakov, "Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms", Proceedings of the International AAAI Conference on Web and Social Media. (Vol. 15, pp. 913-922). 2021. download.
  2. Firoj Alam and Shaden Shaar and Fahim Dalvi and Hassan Sajjad and Alex Nikolov and Hamdy Mubarak and Giovanni Da San Martino and Ahmed Abdelali and Nadir Durrani and Kareem Darwish and Abdulaziz Al-Homaid and Wajdi Zaghouani and Tommaso Caselli and Gijs Danoe and Friso Stolk and Britt Bruntink and Preslav Nakov, "Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society", Findings of EMNLP 2021, download.
@InProceedings{alam2020call2arms,
  title		= {Fighting the {COVID}-19 Infodemic in Social Media: A
		  Holistic Perspective and a Call to Arms},
  author	= {Alam, Firoj and Dalvi, Fahim and Shaar, Shaden and
		  Durrani, Nadir and Mubarak, Hamdy and Nikolov, Alex and {Da
		  San Martino}, Giovanni and Abdelali, Ahmed and Sajjad,
		  Hassan and Darwish, Kareem and Nakov, Preslav},
  year		= {2021},
  pages		= {913-922},
  month	= {May},
  volume	= {15},
  booktitle	= {Proceedings of the International {AAAI} Conference on Web
		  and Social Media},
  series	= {ICWSM~'21},
  url		= {https://ojs.aaai.org/index.php/ICWSM/article/view/18114}
}
@inproceedings{alam2020fighting,
    title={Fighting the {COVID}-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society},
    author={Firoj Alam and Shaden Shaar and Fahim Dalvi and Hassan Sajjad and Alex Nikolov and Hamdy Mubarak and Giovanni Da San Martino and Ahmed Abdelali and Nadir Durrani and Kareem Darwish and Abdulaziz Al-Homaid and Wajdi Zaghouani and Tommaso Caselli and Gijs Danoe and Friso Stolk and Britt Bruntink and Preslav Nakov},
    booktitle = {Findings of EMNLP 2021},
    year={2021},
}

Credits

  • Firoj Alam, Qatar Computing Research Institute, HBKU
  • Shaden Shaar, Qatar Computing Research Institute, HBKU
  • Alex Nikolov, Sofia University
  • Hamdy Mubarak, Qatar Computing Research Institute, HBKU
  • Giovanni Da San Martino, Qatar Computing Research Institute, HBKU
  • Ahmed Abdelali, Qatar Computing Research Institute, HBKU
  • Fahim Dalvi, Qatar Computing Research Institute, HBKU
  • Nadir Durrani, Qatar Computing Research Institute, HBKU
  • Hassan Sajjad, Qatar Computing Research Institute, HBKU
  • Kareem Darwish, Qatar Computing Research Institute, HBKU
  • Preslav Nakov, Qatar Computing Research Institute, HBKU

Licensing

This dataset is free for general research use.

Contact

Please contact [email protected]

Acknowledgment

Thanks to the QCRI's Crisis Computing team for facilitating us with Micromappers.

Releases

No releases published

Packages

No packages published