v0.2.1: Bugs and Datasets Fixed and Minor Updates

thakur-nandan released this 19 Jul 16:07

· 150 commits to main since this release

1. New script to utilize docT5query in parallel with multiple GPUs!

Thanks to @joshdevins, we have a new script to utilize multiple GPUs in parallel to generate multiple queries for passages using a question generation model faster. Check it out [here].
You can now pass your custom GPU device if CUDA recognizable devices are not present for question generation.

2. PQ Hashing with OPQ Rotation and Scalar Quantizer from Faiss!

Now you can utilize OPQ rotation before using PQ hashing and Scalar Quantizer for fp16 faiss search instead of original fp32.

3. Top-k Accuracy Metric which is commonly used in the DPR repository by facebook!

DPR repository evaluates retrieval models using the top-k retriever accuracy. This would allow evaluating top-k accuracy using the BEIR repository!

top_k_accuracy = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="top_k_accuracy")

4. Sorting of corpus documents by text length before encoding using a dense retriever!

We now sort the corpus documents by longest size first, This has two advantages:
1. Why Sort? Similar lengths of texts are now encoded within a single batch, this would help speed up the corpus encoding process.
2. Why Sort longest to smallest? max GPU memory required can be found out in the beginning, so if OOM occurs it will occur in the beginning.

5. FEVER dataset training qrels, problems with doc-ids with special characters now fixed!

There were issues with training qrels in the FEVER dataset. The doc-ids with special characters, for eg. Zlatan_Ibrahimović or Beyoncé had the wrong special characters present in qrels/train.tsv. These were manually fixed by myself and now there are no more similar issues present in the corpus.
New md5hash for the fever.zip dataset: 5a818580227bfb4b35bb6fa46d9b6c03.

Assets 2