Skip to content

Latest commit

 

History

History
137 lines (105 loc) · 4.08 KB

File metadata and controls

137 lines (105 loc) · 4.08 KB

Question Answering

SQuAD

SQuAD datasets is distributed under the CC BY-SA 4.0 license.

Run the following command to download squad

python3 prepare_squad.py --version 1.1 # Squad 1.1
python3 prepare_squad.py --version 2.0 # Squad 2.0

For all datasets we support, we provide command-line-toolkits for downloading them as

nlp_data prepare_squad --version 1.1
nlp_data prepare_squad --version 2.0

Directory structure of the squad dataset will be as follows, where version can be 1.1 or 2.0:

squad
├── train-v{version}.json
├── dev-v{version}.json

SearchQA

Following BSD-3-Clause License, we uploaded the SearchQA to our S3 bucket and provide the link to download the processed txt files. Please check out the Google drive link to download to raw and split files collected through web search using the scraper from GitHub repository.

Download SearchQA Dataset with python command or Command-line Toolkits

python3 prepare_searchqa.py

# Or download with command-line toolkits
nlp_data prepare_searchqa

Directory structure of the SearchQA dataset will be as follows

searchqa
├── train.txt
├── val.txt
├── test.txt

TriviaQA

TriviaQA is an open domain QA dataset. See more useful scripts in Offical Github.

Run the following command to download TriviaQA

python3 prepare_triviaqa.py --type rc         # Download TriviaQA version 1.0 for RC (2.5G)
python3 prepare_triviaqa.py --type unfiltered # Download unfiltered TriviaQA version 1.0 (604M)

# Or download with command-line toolkits
nlp_data prepare_triviaqa --type rc
nlp_data prepare_triviaqa --type unfiltered

Directory structure of the triviaqa (rc and unfiltered) dataset will be as follows:

triviaqa
├── triviaqa-rc
    ├── qa
        ├── verified-web-dev.json        
        ├── web-dev.json                   
        ├── web-train.json     
        ├── web-test-without-answers.json
        ├── verified-wikipedia-dev.json
        ├── wikipedia-test-without-answers.json
        ├── wikipedia-dev.json  
        ├── wikipedia-train.json
    ├── evidence
        ├── web
        ├── wikipedia

├── triviaqa-unfiltered
    ├── unfiltered-web-train.json
    ├── unfiltered-web-dev.json
    ├── unfiltered-web-test-without-answers.json

HotpotQA

HotpotQA is distributed under a CC BY-SA 4.0 License. We only provide download scripts (run by the following command), and please check out the GitHub repository for the details of preprocessing and evaluation.

python3 prepare_hotpotqa.py

# Or download with command-line toolkits
nlp_data prepare_hotpotqa

Directory structure of the hotpotqa dataset will be as follows

hotpotqa
├── hotpot_train_v1.1.json
├── hotpot_dev_fullwiki_v1.json
├── hotpot_dev_distractor_v1.json
├── hotpot_test_fullwiki_v1.json

NaturalQuestions

NaturalQuestions is an open domain QA dataset. This dataset contains questions from real users. For more details about this dataset, check out https://ai.google.com/research/NaturalQuestions

Run the following command to download NaturalQuestions and extract gz files.

python3 prepare_naturalquestions.py --extract
# Download NaturalQuestions simplified version 1.0(5.4G)

# Or download with command-line toolkits
nlp_data prepare_naturalquestions --extract

If you do not want to extract gz files, just run:

python3 prepare_naturalquestions.py

# Or download with command-line toolkits
nlp_data prepare_naturalquestions

Directory structure of the NaturalQuestions dataset will be as follows

NaturalQuestions
├── v1.0-simplified_simplified-nq-train.jsonl
├── nq-dev-all.jsonl