Open Domain Question Answering (ODQA) is a task to find an exact answer to any question in Wikipedia articles. Thus, given only a question, the system outputs the best answer it can find:
Question:
What is the name of Darth Vader's son?
Answer:
Luke Skywalker
There are pretrained ODQA models for English and Russian languages in DeepPavlov.
The architecture of ODQA skill is modular and consists of two models, a ranker and a reader. The ranker is based on DrQa proposed by Facebook Research (Reading Wikipedia to Answer Open-Domain Questions) and the reader is based on R-Net proposed by Microsoft Research Asia ("R-NET: Machine Reading Comprehension with Self-matching Networks") and its implementation by Wenxuan Zhou.
Tensorflow-1.4.0 with GPU support is required to run this model.
The ODQA ranker and ODQA reader should be trained separately. Warning: training the ranker on English Wikipedia requires 16 GB RAM. Run the following to fit the ranker:
python -m deeppavlov.deep train deeppavlov/configs/odqa/en_ranker_prod.json
Read about training the reader in our separate reader tutorial.
ODQA, reader and ranker can be interacted separately. Warning: interacting the ranker and ODQA on English Wikipedia requires 16 GB RAM. Run the following to interact ODQA:
python -m deeppavlov.deep train deeppavlov/configs/odqa/en_odqa_infer_prod.json
Run the following to interact the ranker:
python -m deeppavlov.deep interact deeppavlov/configs/odqa/en_ranker_prod.json
Read about interacting the reader in our separate reader tutorial.
The ODQA configs suit only model inferring purposes. The ranker config should be used for ranker training and the reader config should be used for reader training.
The ranker config for English language can be found at deeppavlov/configs/odqa/en_ranker_prod.json
The ranker config for Russian language can be found at deeppavlov/configs/odqa/ru_ranker_prod.json
- dataset_iterator - downloads Wikipidia DB, creates batches for ranker fitting
- data_dir - a directory to download DB to
- data_url - an URL to download Wikipedia DB from
- shuffle - whether to perform shuffling when iterating over DB or not
- chainer - pipeline manager
- in - pipeline input data (questions)
- out - pipeline output data (Wikipedia articles ids and scores of the articles)
- tfidf_ranker - the ranker class
- in - ranker input data (questions)
- out - ranker output data (Wikipedia articles ids)
- fit_on_batch - fit the ranker on batches of Wikipedia articles
- vectorizer - a vectorizer class
- fit_on_batch - fit the vectorizer on batches of Wikipedia articles
- save_path - a path to serialize a vectorizer to
- load_path - a path to load a vectorizer from
- tokenizer - a tokenizer class
- lemmas - whether to lemmatize tokens or not
- ngram_range - ngram range for vectorizer features
- train - parameters for vectorizer fitting
- validate_best- is ingnored, any value
- test_best - is ignored, any value
- batch_size - how many Wikipedia articles should return the dataset iterator in a single batch
Default ODQA config for English language is deeppavlov/configs/odqa/en_odqa_infer_prod.json
Default ODQA config for Russian language is deeppavlov/configs/odqa/ru_odqa_infer_prod.json
The components of ODQA config can be referred to ranker config and reader config accordingly. However, main inputs and outputs are worth explaining:
- chainer - pipeline manager
- in - pipeline input data (questions)
- out - pipeline output data (answers)
Wikipedia data and pretrained ODQA models are downloaded in deeppavlov/download/odqa
by default.
enwiki.db SQLite database consists of 5159530 Wikipedia articles and is built by the following steps:
- Download a Wikipedia dump file. We took the latest enwiki (from 2018-02-11)
- Unpack and extract the articles with WikiExtractor
(with
--json
,--no-templates
,--filter_disambig_pages
options) - Build a database with the help of DrQA script.
enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents
which is
2**24 x 5159530
. This matrix is built with deeppavlov/models/vectorizers/hashing_tfidf_vectorizer.HashingTfidfVectorizer
class.
ruwiki.db SQLite database consists of 1463888 Wikipedia articles and is built by the following steps:
- Download a Wikipedia dump file. We took the latest ruwiki (from 2018-04-01)
- Unpack and extract the articles with WikiExtractor
(with
--json
,--no-templates
,--filter_disambig_pages
options) - Build a database with the help of DrQA script.
ruwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents
which is
2**24 x 1463888
. This matrix is built with deeppavlov/models/vectorizers/hashing_tfidf_vectorizer.HashingTfidfVectorizer
class.