This project develops an information retrieval system to identify relevant COVID-19 documents. It aims to aid public understanding, support clinicians, and contribute to global pandemic research.
- CORD-19: Round 5 dataset from Kaggle, including 128K research papers in JSON and metadata formats.
- Data Pre-processing: Transform raw documents into a searchable format.
- Document Indexing: Utilize Elasticsearch for efficient indexing and retrieval.
- Document Retrieval: Implement query schemes for optimal document fetching.
- Re-ranking: Apply BERT-based models to enhance document relevance ranking.
- Utilized Trec_eval for evaluation.
- Achieved significant precision improvements with various query combinations and re-ranking techniques.
- Best results observed with COVID-SciBERT model in re-ranking phase.
Instructions for setting up and running the system are provided, detailing steps from data pre-processing to document re-ranking.
Refer to the LICENSE file for licensing details.
For further information and detailed methodology, refer to the full report.