This page collects NLU datasets proposed in 2018.
Dataset | task | style | size | source | where | web | misc | similar datasets |
---|---|---|---|---|---|---|---|---|
CoQA | RC | free form (+no ans) | 127k | various articles | TACL? | url | conversational questions | QuAC |
QuAC | RC | extraction (+no ans) | 100k | Wikipedia | EMNLP2018 | url | conversational questions | CoQA |
HotpotQA | RC | extraction | 113k | Wikipedia | EMNLP2018 | url | multi-hop reasoning | QAngaroo |
SWAG | QA | multiple choice | 113k | video caption | EMNLP2018 | url | situational commonsense reasoning | |
DNC | NLI | textual entailment | 570k | NLP tasks | EMNLP2018 | url | diverse NLI | SNLI, MultiNLI |
OpenBookQA | QA | multiple choice | 6k | science facts | EMNLP2018 | url | external knowledge | ARC |
RecipeQA | RC+ | various | 36k | recipe | EMNLP2018 | url | multimodal comprehension | TextbookQA, FigureQA |
CLOTH | RC | cloze | 99k | English exams | EMNLP2018 | url | RACE | |
DuoRC | RC | extraction | 186k | movie plot | ACL2018 | url | NarrativeQA | |
SQuAD2.0 | RC | extraction (+no ans) | 150k | Wikipedia | ACL2018 | url | no answer: 50k | NewsQA |
CliCR | RC | cloze | 100k | clinical case text | NAACL2018 | url | ||
FEVER | NLI? | fact verification | 185k | Wikipedia | NAACL2018 | url | ||
MultiRC | RC | multiple choice | 6k+ | various articles | NAACL2018 | url | multiple sentence reasoning | MCTest |
ProPara | RC | various | 2k | procedural text | NAACL2018 | url | bAbI, SCoNE | |
ARC | RC | multiple choice | 8k | science exam | ? | url | easy 5197, challenge 2590 |
TODO:
- Interpretation of Natural Language Rules in Conversational Machine Reading (Saeidi+ 2018, EMNLP)
- Multi-Relational Question Answering from Narratives: Machine Reading and Reasoning in Simulated Worlds (Labutov+ 2018, ACL)
- Event2Mind: Commonsense Inference on Events, Intents, and Reactions (Rashkin+ 2018, ACL)
- Modeling Naive Psychology of Characters in Simple Commonsense Stories (Rashkin+ 2018, ACL)
- emrQA: A Large Corpus for Question Answering on Electronic Medical Records (Pampari+ 2018, EMNLP)
Note:
- QA = question answering, RC = reading comprehension = question answering with the context, NLI = natural language inference aka recognizing textual entailment