A Final Year Project from The Chinese University of Hong Kong, focused on developing a Retrieval-Augmented Generation (RAG) pipeline to answer complex scientific MC questions using LLMs like DeBERTaV3 and Mistral 7B. Achieved 16th/2665 in a Kaggle competition with MAP@3 of 0.919, and received the highest grade of A in both terms.
- Enhanced Context Generation:
- Section-level retrieval from CirrusSearch Wikipedia dataset to avoid information loss.
- Dual embedding models (
e5-large-v2
andbge-large-en-v1.5
) for diverse context retrieval.
- Model Architecture:
- DeBERTaV3 fine-tuned with LoRA for efficient training.
- Mistral 7B (QLoRA) for bottom 10% hardest questions with choice-swapping inference.
- Dataset:
- 60,000 MC questions (generated via GPT-3.5, SciQ, and EduQG).
- Improved Wikipedia datasets (science-focused and general articles).
- Ensemble Methods:
- Average ensemble of predictions from dual-context DeBERTa.
├── data/ # Processed datasets (MC questions, Wikipedia sections)
├── models/ # Fine-tuned DeBERTaV3 and Mistral 7B checkpoints
├── src/
│ ├── data_generation/ # GPT-3.5 MC question generation pipeline
│ ├── context_retrieval/ # context retrieval with similarity search with FAISS and cosine similarity
│ ├── training/ # LoRA/QLoRA fine-tuning scripts
│ └── inference/ # Ensemble and Mistral 7B inference pipelines
│ img/
└── README.md
Model | Configuration | MAP@3 |
---|---|---|
DeBERTaV3 (Ensemble) | Dual-context + 60k data | 0.914 |
Mistral 7B + DeBERTa | Mistral on the Bottom 10% questions | 0.919 |
- Datasets: SciQ, EduQG, Kaggle Wikipedia
- Models: DeBERTaV3, Mistral 7B
- Tools: PyTorch, Huggingface, LoRA, QLoRA, FAISS, Sentence-BERT, Numpy, Pandas