Implementation of an Information Retrieval system based on Stanford's CS 276 course.
This repository contains the following implementations of an information retrieval system:
- BSBI: Block Sort-Based Indexing for efficient document and query indexing.
- Spelling Corrector: A probability-based model for suggesting query spelling corrections.
To use the modules, install this repo from GitHub via pip/poetry/uv:
pip install git+https://github.com/jlondonobo/information-retrieval
Or clone and install like:
git clone https://github.com/jlondonobo/information-retrieval
pip install /information-retrieval
For examples on how to use the modules, check the notebooks
directory.
The BSBI module uses a specifically formatted input dataset. It must abide by the following format:
- Space-separated tokens.
- Block size of ~1000 documents.
- Input data structure:
Each folder contains approximately 1000 documents.
data/ ├── 0/ │ ├── document1.txt │ ├── document2.txt │ └── ... ├── 1/ │ ├── document1001.txt │ ├── document1002.txt │ └── ... └── 2/ ├── document2001.txt ├── document2002.txt └── ...
This project is licensed under the MIT license.