Skip to content

Latest commit

 

History

History
139 lines (101 loc) · 6.28 KB

README.md

File metadata and controls

139 lines (101 loc) · 6.28 KB

Open Domain Question Answering Skill on Wikipedia

Task definition

Open Domain Question Answering (ODQA) is a task to find an exact answer to any question in Wikipedia articles. Thus, given only a question, the system outputs the best answer it can find:

Question:

What is the name of Darth Vader's son?

Answer:

Luke Skywalker

Languages

There are pretrained ODQA models for English and Russian languages in DeepPavlov.

Models

The architecture of ODQA skill is modular and consists of two models, a ranker and a reader. The ranker is based on DrQa proposed by Facebook Research (Reading Wikipedia to Answer Open-Domain Questions) and the reader is based on R-Net proposed by Microsoft Research Asia ("R-NET: Machine Reading Comprehension with Self-matching Networks") and its implementation by Wenxuan Zhou.

Running ODQA

Tensorflow-1.4.0 with GPU support is required to run this model.

Training

The ODQA ranker and ODQA reader should be trained separately. Warning: training the ranker on English Wikipedia requires 16 GB RAM. Run the following to fit the ranker:

python -m deeppavlov.deep train deeppavlov/configs/odqa/en_ranker_prod.json

Read about training the reader in our separate reader tutorial.

Interacting

ODQA, reader and ranker can be interacted separately. Warning: interacting the ranker and ODQA on English Wikipedia requires 16 GB RAM. Run the following to interact ODQA:

python -m deeppavlov.deep train deeppavlov/configs/odqa/en_odqa_infer_prod.json

Run the following to interact the ranker:

python -m deeppavlov.deep interact deeppavlov/configs/odqa/en_ranker_prod.json

Read about interacting the reader in our separate reader tutorial.

Configuration

The ODQA configs suit only model inferring purposes. The ranker config should be used for ranker training and the reader config should be used for reader training.

Ranker

The ranker config for English language can be found at deeppavlov/configs/odqa/en_ranker_prod.json

The ranker config for Russian language can be found at deeppavlov/configs/odqa/ru_ranker_prod.json

  • dataset_iterator - downloads Wikipidia DB, creates batches for ranker fitting
    • data_dir - a directory to download DB to
    • data_url - an URL to download Wikipedia DB from
    • shuffle - whether to perform shuffling when iterating over DB or not
  • chainer - pipeline manager
    • in - pipeline input data (questions)
    • out - pipeline output data (Wikipedia articles ids and scores of the articles)
  • tfidf_ranker - the ranker class
    • in - ranker input data (questions)
    • out - ranker output data (Wikipedia articles ids)
    • fit_on_batch - fit the ranker on batches of Wikipedia articles
    • vectorizer - a vectorizer class
      • fit_on_batch - fit the vectorizer on batches of Wikipedia articles
      • save_path - a path to serialize a vectorizer to
      • load_path - a path to load a vectorizer from
      • tokenizer - a tokenizer class
        • lemmas - whether to lemmatize tokens or not
        • ngram_range - ngram range for vectorizer features
  • train - parameters for vectorizer fitting
    • validate_best- is ingnored, any value
    • test_best - is ignored, any value
    • batch_size - how many Wikipedia articles should return the dataset iterator in a single batch

ODQA

Default ODQA config for English language is deeppavlov/configs/odqa/en_odqa_infer_prod.json

Default ODQA config for Russian language is deeppavlov/configs/odqa/ru_odqa_infer_prod.json

The components of ODQA config can be referred to ranker config and reader config accordingly. However, main inputs and outputs are worth explaining:

  • chainer - pipeline manager
    • in - pipeline input data (questions)
    • out - pipeline output data (answers)

Pretrained models

Wikipedia data and pretrained ODQA models are downloaded in deeppavlov/download/odqa by default.

enwiki.db

enwiki.db SQLite database consists of 5159530 Wikipedia articles and is built by the following steps:

  1. Download a Wikipedia dump file. We took the latest enwiki (from 2018-02-11)
  2. Unpack and extract the articles with WikiExtractor (with --json, --no-templates, --filter_disambig_pages options)
  3. Build a database with the help of DrQA script.

enwiki_tfidf_matrix.npz

enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 2**24 x 5159530. This matrix is built with deeppavlov/models/vectorizers/hashing_tfidf_vectorizer.HashingTfidfVectorizer class.

ruwiki.db

ruwiki.db SQLite database consists of 1463888 Wikipedia articles and is built by the following steps:

  1. Download a Wikipedia dump file. We took the latest ruwiki (from 2018-04-01)
  2. Unpack and extract the articles with WikiExtractor (with --json, --no-templates, --filter_disambig_pages options)
  3. Build a database with the help of DrQA script.

ruwiki_tfidf_matrix.npz

ruwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 2**24 x 1463888. This matrix is built with deeppavlov/models/vectorizers/hashing_tfidf_vectorizer.HashingTfidfVectorizer class.

References

  1. https://github.com/facebookresearch/DrQA
  2. https://github.com/HKUST-KnowComp/R-Net