Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoCo Benchmark - BM25 & Insights #23

Open
calpt opened this issue Feb 7, 2024 · 9 comments
Open

LoCo Benchmark - BM25 & Insights #23

calpt opened this issue Feb 7, 2024 · 9 comments

Comments

@calpt
Copy link

calpt commented Feb 7, 2024

Hey, thanks for sharing this very interesting work!

I was interested in the recent LoCo benchmark composed for long-context retrieval and found it useful to have results for a very simple lexical baseline method first to put the scores in the blog post into context. As this was not yet done in the blog post, I ran BM25 (via ElasticSearch) on all benchmark tasks based on your eval script. Full results, in comparison to the best-performing M2-BERT-32768 (80M), below (NDCG@10 for all).

BM25

Retrieval Encoders Tau Scrolls Summ. Screen Tau Scrolls Gov. Report Tau Scrolls QMSUM QASPER - Title to Article QASPER - Abstract to Article Average
BM25 97.4 98.7 59.4 94.0 99.4 89.8
M2-BERT-32768 (80M) 98.6 98.5 69.5 97.4 98.7 92.5

BM25 seems to be very competitive on LoCo, coming close to the best model tested in the post's evaluation and outperforming all other tested embedding models. Thus, lexical overlap between queries and correct documents seems to be very high on the benchmark tasks.

QMSum Analysis

Looking a bit closer at the results, we can see that for 4 of 5 tasks, NDCG is well above 90, meaning that BM25 is nearly perfectly able to retrieve the correct documents. The only exception is QMSum, so I looked into its data a bit closer:

Originally, QMSum is a summarization dataset consisting of three text fragments: a corpus of 232 long meeting transcript, a set of 272 questions and 272 query-based summarizations of the transcripts. In the tau/scrolls format, queries and transcripts are joined together in the "input" field whereas summaries are given in the "output" field. This gives 272 pairs of inputs-outputs. LoCo now simply uses "output" as query and "input" as document, giving 272 queries and 272 documents.

This means that in the LoCo doc corpus of QmSum multiple documents are based off the same long meeting transcript, paired with different questions. E.g. for the first 4 documents are:

Passage_0 -> What was agreed upon on sample transcripts? Professor E: So . OK . Doesn't look like it crashed . That's great ...
Passage_1 -> What was said on speech overlap? Professor E: So . OK . Doesn't look like it crashed . That's great ...
Passage_2 -> What’s the current status of recordings and transcriptions? Professor E: So . OK . Doesn't look like it crased. That's great ...
Passage_3 -> What was the future of data collection? Professor E: So . OK . Doesn't look like it crashed . That 's great ...

The truncated part is identical in all four, meaning that the overwhelming part of the documents (with 9748 words on average) is identical apart from the question stated in the first few words. For distinguishing between these groups of documents, only the first few words are therefore relevant.

As an ablation, I removed the questions at the start of all documents and "merged" the resulting identical documents into one and then ran BM25 again. This improves NDCG@10 to 78.7.


Just wanted to share these quick insights into the LoCo benchmark, maybe this is useful to someone!

@DanFu09
Copy link
Collaborator

DanFu09 commented Feb 7, 2024

Interesting, this is a really great analysis! We also noticed this and have been working on an update to the benchmark (LoCoV1). We haven't put it out yet but will do soon (and add this as a great baseline).

CC @jonsaadfalcon

@jonsaadfalcon
Copy link
Collaborator

Thank you for sharing @calpt! If you have an evaluation script for BM25 available, I'd love to take a look and try it out on our new evaluation datasets.

@DanFu09
Copy link
Collaborator

DanFu09 commented Feb 8, 2024

+1, would love to see the script @calpt! The scores a a good bit higher than when we ran BM25 internally so would love to see if we did something wrong!

@calpt
Copy link
Author

calpt commented Feb 8, 2024

Sure, I basically just took your loco_eval.py script, removed everything but the data loading, plugged in the BM25 implementation & eval of BEIR (roughly like this: https://gist.github.com/calpt/56d0d47724a061c4a7bd4a9a8fd990d2) and spun up a local ES docker container.

Looking forward to LoCo v1!

@DanFu09
Copy link
Collaborator

DanFu09 commented Feb 8, 2024 via email

@mahjongmen
Copy link

Hey @DanFu09 would love to know if you have an update on this!

Our team, at Cohere, will likely report on an adjusted version of QMSum (What @calpt described above)

@DanFu09
Copy link
Collaborator

DanFu09 commented Mar 30, 2024 via email

@iNeil77
Copy link

iNeil77 commented May 13, 2024

Hello @DanFu09! I found this benchmark quite exciting and was wondering if you got the chance to upload the newer version to HuggingFace.

@DanFu09
Copy link
Collaborator

DanFu09 commented May 20, 2024

@iNeil77 here you go, Jon's tweet and blog has links: https://x.com/JonSaadFalcon/status/1792623213698232808

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants