LoCo Benchmark - BM25 & Insights #23

calpt · 2024-02-07T18:40:18Z

Hey, thanks for sharing this very interesting work!

I was interested in the recent LoCo benchmark composed for long-context retrieval and found it useful to have results for a very simple lexical baseline method first to put the scores in the blog post into context. As this was not yet done in the blog post, I ran BM25 (via ElasticSearch) on all benchmark tasks based on your eval script. Full results, in comparison to the best-performing M2-BERT-32768 (80M), below (NDCG@10 for all).

BM25

Retrieval Encoders	Tau Scrolls Summ. Screen	Tau Scrolls Gov. Report	Tau Scrolls QMSUM	QASPER - Title to Article	QASPER - Abstract to Article	Average
BM25	97.4	98.7	59.4	94.0	99.4	89.8
M2-BERT-32768 (80M)	98.6	98.5	69.5	97.4	98.7	92.5

BM25 seems to be very competitive on LoCo, coming close to the best model tested in the post's evaluation and outperforming all other tested embedding models. Thus, lexical overlap between queries and correct documents seems to be very high on the benchmark tasks.

QMSum Analysis

Looking a bit closer at the results, we can see that for 4 of 5 tasks, NDCG is well above 90, meaning that BM25 is nearly perfectly able to retrieve the correct documents. The only exception is QMSum, so I looked into its data a bit closer:

Originally, QMSum is a summarization dataset consisting of three text fragments: a corpus of 232 long meeting transcript, a set of 272 questions and 272 query-based summarizations of the transcripts. In the tau/scrolls format, queries and transcripts are joined together in the "input" field whereas summaries are given in the "output" field. This gives 272 pairs of inputs-outputs. LoCo now simply uses "output" as query and "input" as document, giving 272 queries and 272 documents.

This means that in the LoCo doc corpus of QmSum multiple documents are based off the same long meeting transcript, paired with different questions. E.g. for the first 4 documents are:

Passage_0 -> What was agreed upon on sample transcripts? Professor E: So . OK . Doesn't look like it crashed . That's great ...
Passage_1 -> What was said on speech overlap? Professor E: So . OK . Doesn't look like it crashed . That's great ...
Passage_2 -> What’s the current status of recordings and transcriptions? Professor E: So . OK . Doesn't look like it crased. That's great ...
Passage_3 -> What was the future of data collection? Professor E: So . OK . Doesn't look like it crashed . That 's great ...

The truncated part is identical in all four, meaning that the overwhelming part of the documents (with 9748 words on average) is identical apart from the question stated in the first few words. For distinguishing between these groups of documents, only the first few words are therefore relevant.

As an ablation, I removed the questions at the start of all documents and "merged" the resulting identical documents into one and then ran BM25 again. This improves NDCG@10 to 78.7.

Just wanted to share these quick insights into the LoCo benchmark, maybe this is useful to someone!

The text was updated successfully, but these errors were encountered:

DanFu09 · 2024-02-07T18:47:33Z

Interesting, this is a really great analysis! We also noticed this and have been working on an update to the benchmark (LoCoV1). We haven't put it out yet but will do soon (and add this as a great baseline).

CC @jonsaadfalcon

jonsaadfalcon · 2024-02-07T22:53:22Z

Thank you for sharing @calpt! If you have an evaluation script for BM25 available, I'd love to take a look and try it out on our new evaluation datasets.

DanFu09 · 2024-02-08T15:39:24Z

+1, would love to see the script @calpt! The scores a a good bit higher than when we ran BM25 internally so would love to see if we did something wrong!

calpt · 2024-02-08T18:07:59Z

Sure, I basically just took your loco_eval.py script, removed everything but the data loading, plugged in the BM25 implementation & eval of BEIR (roughly like this: https://gist.github.com/calpt/56d0d47724a061c4a7bd4a9a8fd990d2) and spun up a local ES docker container.

Looking forward to LoCo v1!

DanFu09 · 2024-02-08T18:12:16Z

Great, we’ll take a look! CC @jonsaadfalcon

…

On Thu, Feb 8, 2024 at 10:08 AM calpt ***@***.***> wrote: Sure, I basically just took your loco_eval.py script, removed everything but the data loading, plugged in the BM25 implementation & eval of BEIR (roughly like this: https://gist.github.com/calpt/56d0d47724a061c4a7bd4a9a8fd990d2) and spun up a local ES docker container <https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html> . Looking forward to LoCo v1! — Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDDIITWTPU3RZTC6YL5P2DYSUIAXAVCNFSM6AAAAABC6KVFQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGY3TQMJQGA> . You are receiving this because you commented.Message ID: ***@***.***>

mahjongmen · 2024-03-28T13:57:05Z

Hey @DanFu09 would love to know if you have an update on this!

Our team, at Cohere, will likely report on an adjusted version of QMSum (What @calpt described above)

DanFu09 · 2024-03-30T17:40:39Z

Hi Elliott, thanks for the interest! We have an updated LoCoV1 described in the arXiv ( https://arxiv.org/abs/2402.07440v2) - will have it on HF with updated checkpoints soon (we ran into ICLR rebuttals before we got a chance to clean it up for upload). If you DM/email me and Jon we can try to share access to the private HF dataset?

…

On Thu, Mar 28, 2024 at 9:57 AM Elliott Choi ***@***.***> wrote: Hey @DanFu09 <https://github.com/DanFu09> would love to know if you have an update on this! Our team, at Cohere, will likely report on an adjusted version of QMSum (What @calpt <https://github.com/calpt> described above) — Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDDIIW733ORKRYBEDX6FO3Y2QOUPAVCNFSM6AAAAABC6KVFQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVGI2TAOJVGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

iNeil77 · 2024-05-13T09:59:35Z

Hello @DanFu09! I found this benchmark quite exciting and was wondering if you got the chance to upload the newer version to HuggingFace.

DanFu09 · 2024-05-20T18:33:37Z

@iNeil77 here you go, Jon's tweet and blog has links: https://x.com/JonSaadFalcon/status/1792623213698232808

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoCo Benchmark - BM25 & Insights #23

LoCo Benchmark - BM25 & Insights #23

calpt commented Feb 7, 2024 •

edited

Loading

DanFu09 commented Feb 7, 2024

jonsaadfalcon commented Feb 7, 2024

DanFu09 commented Feb 8, 2024

calpt commented Feb 8, 2024

DanFu09 commented Feb 8, 2024 via email

mahjongmen commented Mar 28, 2024

DanFu09 commented Mar 30, 2024 via email

iNeil77 commented May 13, 2024

DanFu09 commented May 20, 2024

LoCo Benchmark - BM25 & Insights #23

LoCo Benchmark - BM25 & Insights #23

Comments

calpt commented Feb 7, 2024 • edited Loading

BM25

QMSum Analysis

DanFu09 commented Feb 7, 2024

jonsaadfalcon commented Feb 7, 2024

DanFu09 commented Feb 8, 2024

calpt commented Feb 8, 2024

DanFu09 commented Feb 8, 2024 via email

mahjongmen commented Mar 28, 2024

DanFu09 commented Mar 30, 2024 via email

iNeil77 commented May 13, 2024

DanFu09 commented May 20, 2024

calpt commented Feb 7, 2024 •

edited

Loading