Skip to content

Commit

Permalink
Add OpenAI text-embedding-3-large msmarco 2cr (#1863)
Browse files Browse the repository at this point in the history
  • Loading branch information
manveertamber authored Apr 14, 2024
1 parent 2445f15 commit 184a212
Show file tree
Hide file tree
Showing 8 changed files with 219 additions and 31 deletions.
169 changes: 139 additions & 30 deletions docs/2cr/msmarco-v1-passage.html
Original file line number Diff line number Diff line change
Expand Up @@ -5941,11 +5941,120 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</div>
<!-- Tabs content -->

</div></td>
</tr>
<!-- Condition: OpenAI text-embedding-3-large: pre-encoded queries -->
<tr class="accordion-toggle collapsed" id="row52" data-toggle="collapse" data-parent="#row52" href="#collapse52">
<td class="expand-button"></td>
<td style="min-width: 85px"></td>
<td style="min-width: 400px">OpenAI text-embedding-3-large: pre-encoded queries</td>
<td>0.5259</td>
<td>0.7173</td>
<td>0.8991</td>
<td></td>
<td>0.5134</td>
<td>0.7163</td>
<td>0.8884</td>
<td></td>
<td>0.3342</td>
<td>0.9885</td>
</tr>
<tr class="hide-table-padding">
<td></td>
<td colspan="11">
<div id="collapse52" class="collapse in p-3">

<!-- Tabs navs -->
<ul class="nav nav-tabs mb-3" id="row52-tabs" role="tablist">
<li class="nav-item" role="presentation">
<a class="nav-link active" id="row52-tab1-header" data-mdb-toggle="tab" href="#row52-tab1" role="tab" aria-controls="row52-tab1" aria-selected="true" style="text-transform:none">TREC 2019</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row52-tab2-header" data-mdb-toggle="tab" href="#row52-tab2" role="tab" aria-controls="row52-tab2" aria-selected="false" style="text-transform:none">TREC 2020</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row52-tab3-header" data-mdb-toggle="tab" href="#row52-tab3" role="tab" aria-controls="row52-tab3" aria-selected="false" style="text-transform:none">dev</a>
</li>
</ul>
<!-- Tabs navs -->

<!-- Tabs content -->
<div class="tab-content" id="row52-content">
<div class="tab-pane fade show active" id="row52-tab1" role="tabpanel" aria-labelledby="row52-tab1">
Command to generate run on TREC 2019 queries:

<blockquote class="mycode">
<pre><code>python -m pyserini.search.faiss \
--threads 16 --batch-size 512 \
--index msmarco-v1-passage.openai-text-embedding-3-large \
--topics dl19-passage --encoded-queries openai-text-embedding-3-large-dl19-passage \
--output run.msmarco-v1-passage.openai-text-embedding-3-large.dl19.txt
</code></pre></blockquote>
Evaluation commands:

<blockquote class="mycode">
<pre><code>python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage \
run.msmarco-v1-passage.openai-text-embedding-3-large.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage \
run.msmarco-v1-passage.openai-text-embedding-3-large.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage \
run.msmarco-v1-passage.openai-text-embedding-3-large.dl19.txt
</code></pre>
</blockquote>

</div>
<div class="tab-pane fade" id="row52-tab2" role="tabpanel" aria-labelledby="row52-tab2">
Command to generate run on TREC 2020 queries:

<blockquote class="mycode">
<pre><code>python -m pyserini.search.faiss \
--threads 16 --batch-size 512 \
--index msmarco-v1-passage.openai-text-embedding-3-large \
--topics dl20 --encoded-queries openai-text-embedding-3-large-dl20 \
--output run.msmarco-v1-passage.openai-text-embedding-3-large.dl20.txt
</code></pre></blockquote>
Evaluation commands:

<blockquote class="mycode">
<pre><code>python -m pyserini.eval.trec_eval -c -l 2 -m map dl20-passage \
run.msmarco-v1-passage.openai-text-embedding-3-large.dl20.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl20-passage \
run.msmarco-v1-passage.openai-text-embedding-3-large.dl20.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl20-passage \
run.msmarco-v1-passage.openai-text-embedding-3-large.dl20.txt
</code></pre>
</blockquote>

</div>
<div class="tab-pane fade" id="row52-tab3" role="tabpanel" aria-labelledby="row52-tab3">
Command to generate run on dev queries:

<blockquote class="mycode">
<pre><code>python -m pyserini.search.faiss \
--threads 16 --batch-size 512 \
--index msmarco-v1-passage.openai-text-embedding-3-large \
--topics msmarco-passage-dev-subset --encoded-queries openai-text-embedding-3-large-msmarco-passage-dev-subset \
--output run.msmarco-v1-passage.openai-text-embedding-3-large.dev.txt
</code></pre></blockquote>
Evaluation commands:

<blockquote class="mycode">
<pre><code>python -m pyserini.eval.trec_eval -c -M 10 -m recip_rank msmarco-passage-dev-subset \
run.msmarco-v1-passage.openai-text-embedding-3-large.dev.txt
python -m pyserini.eval.trec_eval -c -m recall.1000 msmarco-passage-dev-subset \
run.msmarco-v1-passage.openai-text-embedding-3-large.dev.txt
</code></pre>
</blockquote>

</div>
</div>
<!-- Tabs content -->

</div></td>
</tr>
<tr><td style="border-bottom: 0"></td></tr>
<!-- Condition: cosDPR-distil: PyTorch -->
<tr class="accordion-toggle collapsed" id="row52" data-toggle="collapse" data-parent="#row52" href="#collapse52">
<tr class="accordion-toggle collapsed" id="row53" data-toggle="collapse" data-parent="#row53" href="#collapse53">
<td class="expand-button"></td>
<td style="min-width: 85px">[13]</td>
<td style="min-width: 400px">cosDPR-distil: PyTorch</td>
Expand All @@ -5963,25 +6072,25 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
<tr class="hide-table-padding">
<td></td>
<td colspan="11">
<div id="collapse52" class="collapse in p-3">
<div id="collapse53" class="collapse in p-3">

<!-- Tabs navs -->
<ul class="nav nav-tabs mb-3" id="row52-tabs" role="tablist">
<ul class="nav nav-tabs mb-3" id="row53-tabs" role="tablist">
<li class="nav-item" role="presentation">
<a class="nav-link active" id="row52-tab1-header" data-mdb-toggle="tab" href="#row52-tab1" role="tab" aria-controls="row52-tab1" aria-selected="true" style="text-transform:none">TREC 2019</a>
<a class="nav-link active" id="row53-tab1-header" data-mdb-toggle="tab" href="#row53-tab1" role="tab" aria-controls="row53-tab1" aria-selected="true" style="text-transform:none">TREC 2019</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row52-tab2-header" data-mdb-toggle="tab" href="#row52-tab2" role="tab" aria-controls="row52-tab2" aria-selected="false" style="text-transform:none">TREC 2020</a>
<a class="nav-link" id="row53-tab2-header" data-mdb-toggle="tab" href="#row53-tab2" role="tab" aria-controls="row53-tab2" aria-selected="false" style="text-transform:none">TREC 2020</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row52-tab3-header" data-mdb-toggle="tab" href="#row52-tab3" role="tab" aria-controls="row52-tab3" aria-selected="false" style="text-transform:none">dev</a>
<a class="nav-link" id="row53-tab3-header" data-mdb-toggle="tab" href="#row53-tab3" role="tab" aria-controls="row53-tab3" aria-selected="false" style="text-transform:none">dev</a>
</li>
</ul>
<!-- Tabs navs -->

<!-- Tabs content -->
<div class="tab-content" id="row52-content">
<div class="tab-pane fade show active" id="row52-tab1" role="tabpanel" aria-labelledby="row52-tab1">
<div class="tab-content" id="row53-content">
<div class="tab-pane fade show active" id="row53-tab1" role="tabpanel" aria-labelledby="row53-tab1">
Command to generate run on TREC 2019 queries:

<blockquote class="mycode">
Expand All @@ -6005,7 +6114,7 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</blockquote>

</div>
<div class="tab-pane fade" id="row52-tab2" role="tabpanel" aria-labelledby="row52-tab2">
<div class="tab-pane fade" id="row53-tab2" role="tabpanel" aria-labelledby="row53-tab2">
Command to generate run on TREC 2020 queries:

<blockquote class="mycode">
Expand All @@ -6029,7 +6138,7 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</blockquote>

</div>
<div class="tab-pane fade" id="row52-tab3" role="tabpanel" aria-labelledby="row52-tab3">
<div class="tab-pane fade" id="row53-tab3" role="tabpanel" aria-labelledby="row53-tab3">
Command to generate run on dev queries:

<blockquote class="mycode">
Expand Down Expand Up @@ -6058,7 +6167,7 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</tr>
<tr><td style="border-bottom: 0"></td></tr>
<!-- Condition: BGE-base-en-v1.5: PyTorch -->
<tr class="accordion-toggle collapsed" id="row53" data-toggle="collapse" data-parent="#row53" href="#collapse53">
<tr class="accordion-toggle collapsed" id="row54" data-toggle="collapse" data-parent="#row54" href="#collapse54">
<td class="expand-button"></td>
<td style="min-width: 85px">[14]</td>
<td style="min-width: 400px">BGE-base-en-v1.5: PyTorch</td>
Expand All @@ -6076,25 +6185,25 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
<tr class="hide-table-padding">
<td></td>
<td colspan="11">
<div id="collapse53" class="collapse in p-3">
<div id="collapse54" class="collapse in p-3">

<!-- Tabs navs -->
<ul class="nav nav-tabs mb-3" id="row53-tabs" role="tablist">
<ul class="nav nav-tabs mb-3" id="row54-tabs" role="tablist">
<li class="nav-item" role="presentation">
<a class="nav-link active" id="row53-tab1-header" data-mdb-toggle="tab" href="#row53-tab1" role="tab" aria-controls="row53-tab1" aria-selected="true" style="text-transform:none">TREC 2019</a>
<a class="nav-link active" id="row54-tab1-header" data-mdb-toggle="tab" href="#row54-tab1" role="tab" aria-controls="row54-tab1" aria-selected="true" style="text-transform:none">TREC 2019</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row53-tab2-header" data-mdb-toggle="tab" href="#row53-tab2" role="tab" aria-controls="row53-tab2" aria-selected="false" style="text-transform:none">TREC 2020</a>
<a class="nav-link" id="row54-tab2-header" data-mdb-toggle="tab" href="#row54-tab2" role="tab" aria-controls="row54-tab2" aria-selected="false" style="text-transform:none">TREC 2020</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row53-tab3-header" data-mdb-toggle="tab" href="#row53-tab3" role="tab" aria-controls="row53-tab3" aria-selected="false" style="text-transform:none">dev</a>
<a class="nav-link" id="row54-tab3-header" data-mdb-toggle="tab" href="#row54-tab3" role="tab" aria-controls="row54-tab3" aria-selected="false" style="text-transform:none">dev</a>
</li>
</ul>
<!-- Tabs navs -->

<!-- Tabs content -->
<div class="tab-content" id="row53-content">
<div class="tab-pane fade show active" id="row53-tab1" role="tabpanel" aria-labelledby="row53-tab1">
<div class="tab-content" id="row54-content">
<div class="tab-pane fade show active" id="row54-tab1" role="tabpanel" aria-labelledby="row54-tab1">
Command to generate run on TREC 2019 queries:

<blockquote class="mycode">
Expand All @@ -6120,7 +6229,7 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</blockquote>

</div>
<div class="tab-pane fade" id="row53-tab2" role="tabpanel" aria-labelledby="row53-tab2">
<div class="tab-pane fade" id="row54-tab2" role="tabpanel" aria-labelledby="row54-tab2">
Command to generate run on TREC 2020 queries:

<blockquote class="mycode">
Expand All @@ -6146,7 +6255,7 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</blockquote>

</div>
<div class="tab-pane fade" id="row53-tab3" role="tabpanel" aria-labelledby="row53-tab3">
<div class="tab-pane fade" id="row54-tab3" role="tabpanel" aria-labelledby="row54-tab3">
Command to generate run on dev queries:

<blockquote class="mycode">
Expand Down Expand Up @@ -6177,7 +6286,7 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</tr>
<tr><td style="border-bottom: 0"></td></tr>
<!-- Condition: Cohere Embed English v3.0: pre-encoded queries -->
<tr class="accordion-toggle collapsed" id="row54" data-toggle="collapse" data-parent="#row54" href="#collapse54">
<tr class="accordion-toggle collapsed" id="row55" data-toggle="collapse" data-parent="#row55" href="#collapse55">
<td class="expand-button"></td>
<td style="min-width: 85px"></td>
<td style="min-width: 400px">Cohere Embed English v3.0: pre-encoded queries</td>
Expand All @@ -6195,25 +6304,25 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
<tr class="hide-table-padding">
<td></td>
<td colspan="11">
<div id="collapse54" class="collapse in p-3">
<div id="collapse55" class="collapse in p-3">

<!-- Tabs navs -->
<ul class="nav nav-tabs mb-3" id="row54-tabs" role="tablist">
<ul class="nav nav-tabs mb-3" id="row55-tabs" role="tablist">
<li class="nav-item" role="presentation">
<a class="nav-link active" id="row54-tab1-header" data-mdb-toggle="tab" href="#row54-tab1" role="tab" aria-controls="row54-tab1" aria-selected="true" style="text-transform:none">TREC 2019</a>
<a class="nav-link active" id="row55-tab1-header" data-mdb-toggle="tab" href="#row55-tab1" role="tab" aria-controls="row55-tab1" aria-selected="true" style="text-transform:none">TREC 2019</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row54-tab2-header" data-mdb-toggle="tab" href="#row54-tab2" role="tab" aria-controls="row54-tab2" aria-selected="false" style="text-transform:none">TREC 2020</a>
<a class="nav-link" id="row55-tab2-header" data-mdb-toggle="tab" href="#row55-tab2" role="tab" aria-controls="row55-tab2" aria-selected="false" style="text-transform:none">TREC 2020</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row54-tab3-header" data-mdb-toggle="tab" href="#row54-tab3" role="tab" aria-controls="row54-tab3" aria-selected="false" style="text-transform:none">dev</a>
<a class="nav-link" id="row55-tab3-header" data-mdb-toggle="tab" href="#row55-tab3" role="tab" aria-controls="row55-tab3" aria-selected="false" style="text-transform:none">dev</a>
</li>
</ul>
<!-- Tabs navs -->

<!-- Tabs content -->
<div class="tab-content" id="row54-content">
<div class="tab-pane fade show active" id="row54-tab1" role="tabpanel" aria-labelledby="row54-tab1">
<div class="tab-content" id="row55-content">
<div class="tab-pane fade show active" id="row55-tab1" role="tabpanel" aria-labelledby="row55-tab1">
Command to generate run on TREC 2019 queries:

<blockquote class="mycode">
Expand All @@ -6236,7 +6345,7 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</blockquote>

</div>
<div class="tab-pane fade" id="row54-tab2" role="tabpanel" aria-labelledby="row54-tab2">
<div class="tab-pane fade" id="row55-tab2" role="tabpanel" aria-labelledby="row55-tab2">
Command to generate run on TREC 2020 queries:

<blockquote class="mycode">
Expand All @@ -6259,7 +6368,7 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</blockquote>

</div>
<div class="tab-pane fade" id="row54-tab3" role="tabpanel" aria-labelledby="row54-tab3">
<div class="tab-pane fade" id="row55-tab3" role="tabpanel" aria-labelledby="row55-tab3">
Command to generate run on dev queries:

<blockquote class="mycode">
Expand Down
4 changes: 4 additions & 0 deletions docs/prebuilt-indexes.md
Original file line number Diff line number Diff line change
Expand Up @@ -939,6 +939,10 @@ Detailed configuration information for the pre-built indexes are stored in [`pys
<dt></dt><b><code>msmarco-v1-passage.cohere-embed-english-v3.0</code></b>
<dd>Faiss FlatIP index of the MS MARCO passage corpus encoded by Cohere Embed English v3.0
</dd>
<dt></dt><b><code>msmarco-v1-passage.openai-text-embedding-3-large</code></b>
[<a href="../pyserini/resources/index-metadata/faiss-flat.msmarco-v1-passage.openai-text-embedding-3-large.20240410.c13cd6.README.md">readme</a>]
<dd>Faiss FlatIP index of the MS MARCO passage corpus encoded by OpenAI text-embedding-3-large
</dd>
<dt></dt><b><code>msmarco-v1-doc.ance-maxp</code></b>
<dd>Faiss FlatIP index of the MS MARCO document corpus encoded by the ANCE MaxP encoder
</dd>
Expand Down
23 changes: 23 additions & 0 deletions pyserini/2cr/msmarco-v1-passage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1201,6 +1201,29 @@ conditions:
- MAP: 0.4938
nDCG@10: 0.6666
R@1K: 0.8919
- name: openai-text-embedding-3-large
display: "OpenAI text-embedding-3-large: pre-encoded queries"
display-html: "OpenAI text-embedding-3-large: pre-encoded queries"
display-row: ""
command: python -m pyserini.search.faiss --threads ${dense_threads} --batch-size ${dense_batch_size} --index msmarco-v1-passage.openai-text-embedding-3-large --topics $topics --encoded-queries openai-text-embedding-3-large-$topics --output $output
topics:
- topic_key: msmarco-passage-dev-subset
eval_key: msmarco-passage-dev-subset
scores:
- MRR@10: 0.3342
R@1K: 0.9885
- topic_key: dl19-passage
eval_key: dl19-passage
scores:
- MAP: 0.5259
nDCG@10: 0.7173
R@1K: 0.8991
- topic_key: dl20
eval_key: dl20-passage
scores:
- MAP: 0.5134
nDCG@10: 0.7163
R@1K: 0.8884
- name: cohere-embed-english-v3.0
display: "Cohere Embed English v3.0: pre-encoded queries"
display-html: "Cohere Embed English v3.0: pre-encoded queries"
Expand Down
1 change: 1 addition & 0 deletions pyserini/2cr/msmarco.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@
'',
'openai-ada2',
'openai-ada2-hyde',
'openai-text-embedding-3-large',
'',
'cosdpr-distil-pytorch',
'',
Expand Down
30 changes: 30 additions & 0 deletions pyserini/encoded_query_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,36 @@
"total_queries": 6980,
"downloaded": False
},
"openai-text-embedding-3-large-dl19-passage": {
"description": "TREC DL19 passage queries encoded by OpenAI text-embedding-3-large.",
"urls": [
"https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-openai-text-embedding-3-large-dl19-passage-20240410-c13cd6.tar.gz",
],
"md5": "a2e4ad9dc3288d97b77577552df9ee2b",
"size (bytes)": 541753,
"total_queries": 43,
"downloaded": False
},
"openai-text-embedding-3-large-dl20": {
"description": "TREC DL20 passage queries encoded by OpenAI text-embedding-3-large.",
"urls": [
"https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-openai-text-embedding-3-large-dl20-passage-20240410-c13cd6.tar.gz",
],
"md5": "9911fb1012ff5651f0cf832a81943967",
"size (bytes)": 2515768,
"total_queries": 200,
"downloaded": False
},
"openai-text-embedding-3-large-msmarco-passage-dev-subset": {
"description": "MS MARCO passage dev set queries encoded by OpenAI text-embedding-3-large.",
"urls": [
"https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-openai-text-embedding-3-large-msmarco-passage-dev-subset-20240410-c13cd6.tar.gz",
],
"md5": "4b0bce9c7cb0b55e49920d340924c92f",
"size (bytes)": 87687020,
"total_queries": 6980,
"downloaded": False
},
"atomic-v0.2.1-text-ViT-L-14.laion2b_s32b_b82k-validation": {
"description": "AToMiC text v0.2.1 validation set encoded by ViT-L-14.laion2b_s32b_b82k.",
"urls": [
Expand Down
13 changes: 13 additions & 0 deletions pyserini/prebuilt_index_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -3344,6 +3344,19 @@
"downloaded": False,
"texts": "msmarco-v1-passage"
},
"msmarco-v1-passage.openai-text-embedding-3-large": {
"description": "Faiss FlatIP index of the MS MARCO passage corpus encoded by OpenAI text-embedding-3-large",
"filename": "faiss-flat.msmarco-v1-passage.openai-text-embedding-3-large.20240410.c13cd6.tar.gz",
"readme": "faiss-flat.msmarco-v1-passage.openai-text-embedding-3-large.20240410.c13cd6.README.md",
"urls": [
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/faiss/faiss-flat.msmarco-v1-passage.openai-text-embedding-3-large.20240410.c13cd6.tar.gz"
],
"md5": "e52f046b1decc9bf3a55ac0ff70780d0",
"size compressed (bytes)": 87658796879,
"documents": 8841823,
"downloaded": False,
"texts": "msmarco-v1-passage"
},
"msmarco-v1-doc.ance-maxp": {
"description": "Faiss FlatIP index of the MS MARCO document corpus encoded by the ANCE MaxP encoder",
"filename": "faiss.msmarco-v1-doc.ance_maxp.20210304.b2a1b0.tar.gz",
Expand Down
Loading

0 comments on commit 184a212

Please sign in to comment.