Assignment 1
*** NOTE ***
- Make sure you are connected with internet. (If you are running on cse server run authenticator.py file to bypass the firewall)
- Make sure all three files downloaded in utils folder and total 3 file sizes will combinely 581 MB around otherwise download from this gdrive link - https://drive.google.com/drive/folders/1aT76iVKRdgBf6n9vYB6vatdkcCjySafF?usp=sharing
- Put english-corpora folder inside the data folder (Download from here - https://www.cse.iitk.ac.in/users/arnabb/ir/english/)
This folder contains following directories and files
- data - contains all the data used in this assignment 1 a. englis-corpora - english corpus folder contains all documents b. query.txt - contains a set of 20 queries with proper format mentioned in assignment manual c. query_ground_truth.txt - contains the list of 10 relevant doc in QRels format.
- utils - It contains all necessary file which takes time to build like postings_list, vocab doc wise, doc vectors a. vocab_doc_wise_stemming.npy - A dictionary where keys are docID and values are the vocab present in that documents. b. postings_list.pkl - Postings lists are basically dictionary where key contains vocab and values contain linked list node having two subnode one contains docID and other contains freq of vocab in the given docID. c. sparse_matrix_doc_vectors.npz - This is Doc vectors file required for TF-IDF system. For fast query processing, I have created one doc_vectors which is of shape (doc_size, total vocab) and saved their vectors in sparse_matrix_doc_vectors.npz. In doc vectors, rows represent all the documents and columns represent all the vocab present in all the documents. Doc vectors are sparse matrix inorder to save space I used scipy.sparse.matrix to save all doc vectors.
- output - a folder contains all the generated outputs for all 3 IR system.
- q1.py - This is file for first question in the assignment which majorily doing text preprocessing and cleaning.
- q2.py - This contains necessary function required for ques. 2 having all the IR systems and their helper functions.
- main.py - This is the main file which combines whole assignments.
- run.sh - This is the file that contains all the variable parameters mentioned in the below section.
- Makefile - There are two commands in the makefile one is "install", "run" a. make install - install all the required packages and download the drive files (please follow the drive link if you're not able to download from the make install) b. make run - will run the whole assignments
*** run.sh is the top-level script that runs the entire assignment. ***
To run the entire assignment, go to home directory where this README file is there and use following command
These are the variables that I'm passing as an arguments in the program. [ change accordingly ]
- top_k = 5 - top k documents will be retrieved
- corpus = "./data/english-corpora/" - Corpus path
- vocab_path = './utils/vocab_doc_wise_stemming.npy' - This is the vocab file path (in numpy format) document wise which contains a dictionary where keys are docID and values are the vocab present in that documents which I have already generated required in all 3 IR systems. This takes time to build so if you still want, you can build your own. There is one flag named vocab_flag. Put this variable equal to 1. Otherwise you can use my generated vocab_doc_wise_stemming.npy present in the utils folder.
- postings_path="./utils/postings_list.pkl" - This is the postings list already generated by myself required in all 3 IR systems. This also takes time to build if you want to build put postings_flag=1.
- doc_vector_path='./utils/sparse_matrix_doc_vectors.npz' - This is Doc vectors file required for TF-IDF system. This also takes time to build so if you want to build put vector_flag=1.
- query_file='./data/query.txt' - This is the query path you need to give if you want to try my assignment on someone else query.
- out_folder='./output/' - This is the output folder path.
- vocab_flag=0 - This is for creating own vocab_doc_wise_stemming
- postings_flag=0 - This is for creating own postings lists
- vector_flag=0 - This is for creating own docments vectors.
- Text is splitted by '\t' first.
- Remove extra spaces
- Tokenize the strings
- Remove Punctuations from tokenize words
- Remove Number
- Remove Double quotations from tokens
- Replace URL with url tag
- Remove ascents from string using decode like A°
- Split camelCase word into 'camel' and 'Case'
- Remove number from token
This system takes an average 0.0032 sec to process one query.
- Tokenize the query
- Convert infix query expression to postfix query expression using stack approach a. Check if the given expression is balanced or not b. Check is there any extra parenthesis in the expression
- Processing two operator only in the query &(and) , | (or) and ~(negation) and giving higeher precedence to the former
- Using snowball_stemmer as a stemmer algorithm to find the stem word in the given query
- Generate binary vector based on document size and consider negation sign as well while processing
- Find document which contains the query word using find_matched_doc function and return a binary vector that shows which document contains that word
- Remove stop words from query
This system takes an average 22 sec to process one query.
- Tokenize the query first and remove the stopwords from the query and also remove nonASCII character
- Find the query vector where each dimension represents freq. of token present in the query
- For fast query processing, I have created one doc_vectors which is of shape (doc_size, total vocab) previously and saved their vectors in sparse_matrix_doc_vectors.npz
- In doc vectors, rows represent all the documents and columns represent all the vocab present in all the documents
- Doc vectors are sparse matrix inorder to save space I used scipy.sparse.matrix to save all doc vectors.
- Return top-k documents only.
- I have also tried to prepare champion lists that is nothing but for each vocab in the documents find the rank of documents corresponding to them. But it was taking too much space more than 24 GB) and also taking more than 24 hrs to prepare those lists. But I have provided the code for the same.
- While calculating similarity score we don't need to normalize the query vector as even without normalization, product of V_q and v_d are much higher.
This system takes an average 0.018 sec to process one query.
- Tokenize the query into tokens and remove the stop words and also remove if there's any non-ascii characters
- Get local weight by modified term frequency formula
$$\frac{(k_1+1)tf_d}{k_1(1-b+b\frac{L_d}{L_avg}) + tf_d}$$ - Get global weight by inverse doc frequency as the priors aren't given by given formula
$$\log \frac{n}{df_t}$$ - Get RSVd score using below formula and based on this score, select top k documents
$$RSVd = \sum_{\forall t \in q} \left(\log \frac{n}{df_t}\right) . \frac{(k_1+1)tf_d}{k_1(1-b+b\frac{L_d}{L_avg}) + tf_d}$$