First part of the task was to find 5 most relevant paragraphs based on the question, and then find the piece of text which best answered the question if it did. Approach
- Preprocessed the data removing some of the problems in the dataset like some columns were shifted to the right or broken off in between.
- Used FAISS to find top k paragraphs which matched the question because it provides quick search on basis of similarity search.
- Used cross encoder : ms-marco-MiniLM-L-6-v2 which reordered the top k paragraphs on basis of context.
- Used ELECTRA + SQUAD 2.0 model to extract answers from the paragraph. This was chosen because lot of corpus of data was taken from the SQUAD dataset, so model was already trained on major of them.
This provides quick answer and search of queries across the big corpus.
Results:
It is clearly visible how good results are FAISS + Cross Encoder giving in finding top paragraphs.
Final question answering from the retrieved paragraph:
FineTuning the models on current data would have produced much much better results, but due to time and resource constraint, I wasn't able to do so.
-
Update the path variables in para_finder.ipynb, comment out 3rd last cell, and pass the query as run_search(str(query_text), num_results_to_print) where query_text is the question and num_results_to_print is number of top paragraphs needed.
-
run final_integrated to get the answer. Pretrained weights for final_integrated: https://drive.google.com/file/d/1ubh0X_o1sdgmZyIdqiuFXsZEd726QyDA/view?usp=sharing (quantized.pt)