Doubt Regarding Retrieving Documents in Stemmed Version #937

souravsaha · 2022-01-12T12:59:53Z

souravsaha
Jan 12, 2022

Hello,
I may be asking something very obvious. I could see the msmarco-passage collection is not stemmed, and it contains stop words also. I've one very simple requirement, I want to use BM25 retrieval to retrieve documents, but in a stemmed and analyzed fashion.

One very naive way is to index the msmarco-passage corpora with stemmed and remove the stop words. Do you have any other solutions/workarounds for the above?
Are those scores reported for BM25 in the survey are also not running over the stemmed version of the index? I mean for the 1st stage retrieval (BM25, LMDIR etc.) we need to use a stemmed version right? Or do you maintain two versions of the index? Thanks a lot for your help.