Please first follow the Pyserini BM25 retrieval guide to obtain our reranking candidate.
wget https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.2.tsv.gz -P collections/msmarco-passage/
gzip -d collections/msmarco-passage/qidpidtriples.train.full.2.tsv.gz
Then, download the file which has training triples and uncompress it.
Next, we're going to use collections/msmarco-ltr-passage/
as the working directory to download pre processed data.
mkdir collections/msmarco-ltr-passage/
python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.eval.small.tsv \
--output collections/msmarco-ltr-passage/queries.eval.small.json
python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.dev.small.tsv \
--output collections/msmarco-ltr-passage/queries.dev.small.json
python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.train.tsv \
--output collections/msmarco-ltr-passage/queries.train.json
The above scripts convert queries to json objects with text
, text_unlemm
, raw
, and text_bert_tok
fields.
The first two scripts take ~1 min and the third one is a bit longer (~1.5h).
python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('msmarco-passage-ltr')"
We run the above commands to obtain pre-built index in cache.
Note you can also build index from scratch follow this guide.
Download pretrained IBM models
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/model-ltr-ibm.tar.gz -P collections/msmarco-ltr-passage/
tar -xzvf collections/msmarco-ltr-passage/model-ltr-ibm.tar.gz -C collections/msmarco-ltr-passage/
python scripts/ltr_msmarco/train_ltr_model.py \
--index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3
Compare texts at the bottom of the output with texts below for a quick sanity check.
recall@10:0.48367956064947465
recall@20:0.5796442215854822
recall@50:0.683966093600764
recall@100:0.7545964660936009
recall@200:0.8033428844317098
recall@500:0.8454512893982808
recall@1000:0.8573424068767909
Total training time: XXXX s
Done!
Note that the number may vary due to the randomness of LambdaRank. As long as your outputs are around those values, your training is done correctly.
The training script will train a model at runs/
with your running date in the file name. You can use this as the --model
parameter for reranking.
Number of negative samples used in training can be changed by --neg-sample
, by default is 10.
The script trains a model which optimizes MRR@10 by default.
You can change the mrr_at_10
of this function and here to recall_at_20
to train a model which optimizes recall@20.
You can also self defined a function format like this and change corresponding places mentioned above to have different optimization goal.
Reproduction Log*
- Results reproduced by @Dahlia-Chehata on 2021-07-18 (commit
a6b6545
)