Skip to content

Commit 6de3f47

Browse files
authored
Merge pull request #3337 from vespa-engine/jobergum/add-test
feat(CI): add to tests for the new "Hybrid Text Search Tutorial"
2 parents fe91433 + 210b8b1 commit 6de3f47

File tree

5 files changed

+97
-48
lines changed

5 files changed

+97
-48
lines changed

_data/sidebar.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,8 @@ docs:
167167
url: /en/tutorials/models-hot-swap.html
168168
- page: Text Search
169169
url: /en/tutorials/text-search.html
170+
- page: Hybrid Text Search
171+
url: /en/tutorials/hybrid-search.html
170172
- page: Text Search ML
171173
url: /en/tutorials/text-search-ml.html
172174
- page: Quick Start

en/getting-started.html

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,10 @@
4141
A text search tutorial and introduction to text ranking with Vespa using traditional information retrieval techniques like BM25.
4242
</li>
4343

44+
<li><a href="tutorials/hybrid-search.html">Tutorial: Hybrid Text Search</a>.
45+
A search tutorial and introduction to hybrid text ranking with Vespa combining BM25 with text embedding models.
46+
</li>
47+
4448
<li><a href="tutorials/text-search-ml.html">Tutorial: Improving Text Search with Machine Learning</a>.
4549
This tutorial builds
4650
on the <a href="tutorials/text-search.html">text search tutorial</a> but introduces Learning to Rank to improve relevance.
@@ -56,6 +60,11 @@
5660
combining retrieval over inverted index structures with vector search.
5761
</p>
5862

63+
<strong>RAG (Retrieval-Augmented Generation)</strong>
64+
<p>
65+
Learn how to use Vespa for RAG in the <a href="llms-rag.html#">Retrieval-augmented generation (RAG) in Vespa</a> guide.
66+
</p>
67+
5968
<strong>Recommendation</strong>
6069
<p>
6170
Learn how to use Vespa for content recommendation/personalization in the
@@ -81,6 +90,16 @@
8190
<li><a href="tensorflow.html">Ranking with TensorFlow models</a></li>
8291
</ul>
8392

93+
<strong>Embedding Model Inference</strong>
94+
<p>
95+
Vespa supports integrating <a href="embedding.html">embedding</a> models, this avoids transfering large amounts of embedding vector data
96+
over the network and allows for efficient serving of embedding models.
97+
<ul>
98+
<li><a href="embedding.html#huggingface-embedder">Huggingface Embedder</a> Use single-vector embedding models from Hugging face</li>
99+
<li><a href="embedding.html#colbert-embedder">ColBERT Embedder</a> Use multi-vector embedding models </li>
100+
<li><a href="embedding.html#splade-embedder">Splade Embedder</a> Use sparse learned single vector embedding models</li>
101+
</ul>
102+
84103
<strong>ML Model Lifecycle</strong>
85104
<p>
86105
The <a href="tutorials/models-hot-swap.html">Models hot swap tutorial</a>
@@ -92,6 +111,12 @@
92111
<strong>E-Commerce Search</strong>
93112
<p>The <a href="use-case-shopping.html">e-commerce shopping sample application</a> demonstrates Vespa grouping,
94113
true in-place partial updates, custom ranking and more.</p>
114+
115+
<strong>Examples and starting sample applications</strong>
116+
<p>
117+
There are many examples and starting applications on
118+
<a href="https://github.com/vespa-engine/sample-apps/">GitHub</a> and <a href="https://pyvespa.readthedocs.io/en/latest/examples.html">PyVespa examples</a>.
119+
</p>
95120
</td>
96121
</tr>
97122

en/tutorials/hybrid-search.md

Lines changed: 49 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -135,10 +135,10 @@ schema doc {
135135
}
136136

137137
field embedding type tensor&lt;bfloat16&gt;(v[384]) {
138-
indexing: input title." ".input text | embed | attribute
139-
attribute {
140-
distance-metric: angular
141-
}
138+
indexing: input title." ".input text | embed | attribute
139+
attribute {
140+
distance-metric: angular
141+
}
142142
}
143143

144144
rank-profile bm25 {
@@ -149,7 +149,7 @@ schema doc {
149149

150150
rank-profile semantic {
151151
inputs {
152-
query(e) tensor&lt;bfloat16&gt;(v[384])
152+
query(e) tensor&lt;bfloat16&gt;(v[384])
153153
}
154154
first-phase {
155155
expression: closeness(field, embedding)
@@ -170,7 +170,7 @@ The [string](../reference/schema-reference.html#string) data type represents bot
170170
and there are significant differences between [index and attribute](../text-matching.html#index-and-attribute). The above
171171
schema includes default `match` modes for `attribute` and `index` property for visibility.
172172

173-
Note that we are enabling [BM25](../reference/bm25.html) for `title` and `text`.
173+
Note that we are enabling [BM25](../reference/bm25.html) for `title` and `text`
174174
by including `index: enable-bm25`. The language field is the only field that is not the NFCorpus dataset.
175175
We hardcode its value to "en" since the dataset is English. Using `set_language` avoids automatic language detection and uses the value when processing the other
176176
text fields. Read more in [linguistics](../linguistics.html).
@@ -189,9 +189,9 @@ Our `embedding` vector field is of [tensor](../tensor-user-guide.html) type with
189189
field embedding type tensor<bfloat16>(v[384]) {
190190
indexing: input title." ".input text | embed arctic | attribute
191191
attribute {
192-
distance-metric: angular
192+
distance-metric: angular
193193
}
194-
}
194+
}
195195
```
196196
The `indexing` expression creates the input to the `embed` inference call (in our example the concatenation of the title and the text field). Since
197197
the dataset is small, we do not specify `index` which would build [HNSW](../approximate-nn-hnsw.html) datastructures for faster (but approximate) vector search. This guide uses [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) as the text embedding model. The model is
@@ -250,7 +250,7 @@ Some notes about the elements above:
250250
- `<container>` defines the [container cluster](../jdisc/index.html) for document, query and result processing
251251
- `<search>` sets up the [query endpoint](../query-api.html). The default port is 8080.
252252
- `<document-api>` sets up the [document endpoint](../reference/document-v1-api-reference.html) for feeding.
253-
- `component` with type `hugging-face-embedder` configures the embedder in the application package. This include where to fetch the model files from, the prepend
253+
- `component` with type `hugging-face-embedder` configures the embedder in the application package. This includes where to fetch the model files from, the prepend
254254
instructions, and the pooling strategy. See [huggingface-embedder](../embedding.html#huggingface-embedder) for details and other embedders supported.
255255
- `<content>` defines how documents are stored and searched
256256
- `<min-redundancy>` denotes how many copies to keep of each document.
@@ -316,38 +316,37 @@ $ vespa feed -t http://localhost:8080 vespa-docs.jsonl
316316
</pre>
317317
</div>
318318

319-
On an M1, we expect output like the following:
319+
The output should look like this (rates may vary depending on your machine HW):
320320

321321
<pre>{% highlight json%}
322322
{
323323
"feeder.operation.count": 3633,
324-
"feeder.seconds": 39.723,
324+
"feeder.seconds": 148.515,
325325
"feeder.ok.count": 3633,
326-
"feeder.ok.rate": 91.459,
326+
"feeder.ok.rate": 24.462,
327327
"feeder.error.count": 0,
328328
"feeder.inflight.count": 0,
329-
"http.request.count": 13157,
330-
"http.request.bytes": 21102792,
331-
"http.request.MBps": 0.531,
329+
"http.request.count": 3633,
330+
"http.request.bytes": 2985517,
331+
"http.request.MBps": 0.020,
332332
"http.exception.count": 0,
333-
"http.response.count": 13157,
334-
"http.response.bytes": 1532828,
335-
"http.response.MBps": 0.039,
336-
"http.response.error.count": 9524,
337-
"http.response.latency.millis.min": 0,
338-
"http.response.latency.millis.avg": 1220,
339-
"http.response.latency.millis.max": 13703,
333+
"http.response.count": 3633,
334+
"http.response.bytes": 348320,
335+
"http.response.MBps": 0.002,
336+
"http.response.error.count": 0,
337+
"http.response.latency.millis.min": 316,
338+
"http.response.latency.millis.avg": 787,
339+
"http.response.latency.millis.max": 1704,
340340
"http.response.code.counts": {
341-
"200": 3633,
342-
"429": 9524
341+
"200": 3633
343342
}
344343
}{% endhighlight %}</pre>
345344

346345
Notice:
347346

348347
- `feeder.ok.rate` which is the throughput (Note that this step includes embedding inference). See [embedder-performance](../embedding.html#embedder-performance) for details on embedding inference performance. In this case, embedding inference is the bottleneck for overall indexing throughput.
349-
- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. The `429` are harmless. Vespa asks the client
350-
to slow down the feed speed because of resource contention.
348+
- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. Note that if you observe any `429` responses, these are
349+
harmless. Vespa asks the client to slow down the feed speed because of resource contention.
351350

352351

353352
## Sample queries
@@ -356,14 +355,16 @@ We can now run a few sample queries to demonstrate various ways to perform searc
356355
<div class="pre-parent">
357356
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
358357
<pre data-test="exec" data-test-assert-contains="PLAIN-2">
359-
$ ir_datasets export beir/nfcorpus/test queries --fields query_id text |head -1
358+
$ ir_datasets export beir/nfcorpus/test queries --fields query_id text | head -1
360359
</pre>
361360
</div>
362361

363362
<pre>
364363
PLAIN-2 Do Cholesterol Statin Drugs Cause Breast Cancer?
365364
</pre>
366365

366+
If you see a pipe related error from the above command, you can safely ignore it.
367+
367368
Here, `PLAIN-2` is the query id of the first test query. We'll use this test query to demonstrate querying Vespa.
368369

369370
### Lexical search with BM25 scoring
@@ -393,7 +394,7 @@ This query returns the following [JSON result response](../reference/default-res
393394
"id": "toplevel",
394395
"relevance": 1.0,
395396
"fields": {
396-
"totalCount": 65
397+
"totalCount": 46
397398
},
398399
"coverage": {
399400
"coverage": 100,
@@ -423,7 +424,7 @@ This query returns the following [JSON result response](../reference/default-res
423424
{% endhighlight %}</pre>
424425

425426
The query retrieves and ranks `MED-10` as the most relevant document—notice the `totalCount` which is the number of documents that were retrieved for ranking
426-
phases. In this case, we exposed 65 documents to first-phase ranking, it is higher than our target, but also fewer than the total number of documents that match any query terms.
427+
phases. In this case, we exposed about 50 documents to first-phase ranking, it is higher than our target, but also fewer than the total number of documents that match any query terms.
427428

428429
In the example below, we change the grammar from the default `weakAnd` to `any`, and the query matches 1780, or almost 50% of the indexed documents.
429430

@@ -542,7 +543,7 @@ This query returns the following [JSON result response](../reference/default-res
542543
}{% endhighlight %}</pre>
543544

544545
The result of this vector-based search differed from the previous sparse keyword search, with a different relevant document at position 1. In this case,
545-
the relevance score is 0.606 and calculated by the `closeness` function in the `semantic` rank-profile.
546+
the relevance score is 0.606 and calculated by the `closeness` function in the `semantic` rank-profile. Note that more documents were retrieved than the `targetHits`.
546547

547548
```
548549
rank-profile semantic {
@@ -562,7 +563,7 @@ Note that similarity scores of embedding vectors are often optimized via contras
562563

563564
## Evaluate ranking accuracy
564565

565-
The previous section demonstrated how to combine the Vespa query language with rank profiles to
566+
The previous section demonstrated how to combine the Vespa query language with rank profiles
566567
to implement two different retrieval and ranking strategies.
567568

568569
In the following section we evaluate all 323 test queries with both models to compare their overall effectiveness, measured using [nDCG@10](https://en.wikipedia.org/wiki/Discounted_cumulative_gain). `nDCG@10` is the official evaluation metric of the BEIR benchmark and is an appropriate metric for test sets with graded relevance judgments.
@@ -648,22 +649,22 @@ if __name__ == "__main__":
648649
Then execute the script:
649650
<div class="pre-parent">
650651
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
651-
<pre data-test="exec" data-test-assert-contains="nDCG@10: 0.3">
652+
<pre data-test="exec" data-test-assert-contains="bm25: 0.32">
652653
$ python3 evaluate_ranking.py --ranking bm25 --mode sparse
653654
</pre>
654655
</div>
655656

656657
The script will produce the following output:
657658

658659
<pre>
659-
Ranking metric NDCG@10 for rank profile bm25: 0.3195
660+
Ranking metric NDCG@10 for rank profile bm25: 0.3210
660661
</pre>
661662

662663
Now, we can evaluate the dense model using the same script:
663664

664665
<div class="pre-parent">
665666
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
666-
<pre data-test="exec" data-test-assert-contains="nDCG@10: 0.3">
667+
<pre data-test="exec" data-test-assert-contains="semantic: 0.3">
667668
$ python3 evaluate_ranking.py --ranking semantic --mode dense
668669
</pre>
669670
</div>
@@ -679,6 +680,8 @@ more [measures](https://ir-measur.es/en/latest/measures.html), for example, incl
679680
metrics = [nDCG@10, P(rel=2)@10]
680681
</pre>
681682

683+
Also note that the exact nDCG@10 values may vary slightly between runs.
684+
682685
## Hybrid Search & Ranking
683686

684687
We demonstrated and evaluated two independent retrieval and ranking strategies in the previous sections.
@@ -810,7 +813,7 @@ The above query returns the following [JSON result response](../reference/defaul
810813
"id": "toplevel",
811814
"relevance": 1.0,
812815
"fields": {
813-
"totalCount": 105
816+
"totalCount": 87
814817
},
815818
"coverage": {
816819
"coverage": 100,
@@ -843,7 +846,7 @@ The above query returns the following [JSON result response](../reference/defaul
843846
}{% endhighlight %}</pre>
844847

845848
What is going on here is that we are combining the two top-k query operators using a boolean OR (disjunection).
846-
The `totalCount` is the number of documents retrieved into ranking (About 100, which is higher than 10 + 10).
849+
The `totalCount` is the number of documents retrieved into ranking (About 90, which is higher than 10 + 10).
847850
The `relevance` is the score assigned by `hybrid` rank-profile. Notice that the `matchfeatures` field shows all the feature scores. This is
848851
useful for debugging and understanding the ranking behavior, also for feature logging.
849852

@@ -923,7 +926,7 @@ $ python3 evaluate_ranking.py --ranking hybrid --mode hybrid
923926
Which outputs
924927

925928
<pre>
926-
Ranking metric NDCG@10 for rank profile hybrid: 0.3275
929+
Ranking metric NDCG@10 for rank profile hybrid: 0.3287
927930
</pre>
928931

929932
The `nDCG@10` score is slightly higher than the profiles that only use one of the ranking strategies.
@@ -1063,30 +1066,30 @@ $ python3 evaluate_ranking.py --ranking hybrid-sum --mode hybrid
10631066
</div>
10641067

10651068
<pre>
1066-
Ranking metric NDCG@10 for rank profile hybrid-sum: 0.3232
1069+
Ranking metric NDCG@10 for rank profile hybrid-sum: 0.3244
10671070
</pre>
10681071

10691072

10701073
<div class="pre-parent">
10711074
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
1072-
<pre data-test="exec" data-test-assert-contains="0.33">
1075+
<pre data-test="exec" data-test-assert-contains="0.34">
10731076
$ python3 evaluate_ranking.py --ranking hybrid-normalize-bm25-with-atan --mode hybrid
10741077
</pre>
10751078
</div>
10761079

10771080
<pre>
1078-
Ranking metric NDCG@10 for rank profile hybrid-normalize-bm25-with-atan: 0.3386
1081+
Ranking metric NDCG@10 for rank profile hybrid-normalize-bm25-with-atan: 0.3410
10791082
</pre>
10801083

10811084
<div class="pre-parent">
10821085
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
1083-
<pre data-test="exec" data-test-assert-contains="0.31">
1086+
<pre data-test="exec" data-test-assert-contains="0.32">
10841087
$ python3 evaluate_ranking.py --ranking hybrid-rrf --mode hybrid
10851088
</pre>
10861089
</div>
10871090

10881091
<pre>
1089-
Ranking metric NDCG@10 for rank profile hybrid-rrf: 0.3176
1092+
Ranking metric NDCG@10 for rank profile hybrid-rrf: 0.3207
10901093
</pre>
10911094

10921095
<div class="pre-parent">
@@ -1097,21 +1100,21 @@ $ python3 evaluate_ranking.py --ranking hybrid-linear-normalize --mode hybrid
10971100
</div>
10981101

10991102
<pre>
1100-
Ranking metric NDCG@10 for rank profile hybrid-linear-normalize: 0.3356
1103+
Ranking metric NDCG@10 for rank profile hybrid-linear-normalize: 0.3387
11011104
</pre>
11021105

11031106
On this particular dataset, the `hybrid-normalize-bm25-with-atan` rank profile performs the best, but the difference is small. This also demonstrates that hybrid search
11041107
and ranking is a complex problem and that the effectiveness of the hybrid model depends on the dataset and the retrieval strategies.
11051108

11061109
These results (which is the best) might not
1107-
transfer to your specific retrieval use case and dataset, so it is important to evaluate the effectiveness of a hybrid model on your specific dataset and having
1108-
your own relevance judgments.
1110+
transfer to your specific retrieval use case and dataset, so it is important to evaluate the effectiveness of a hybrid model on
1111+
your specific dataset.
11091112

11101113
See [Improving retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for more information on how to collect relevance judgments for your dataset.
11111114

11121115
### Summary
11131116

1114-
In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and how to expression hybriding ranking using the Vespa ranking framework.
1117+
In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and how to express hybrid ranking using the Vespa ranking framework.
11151118

11161119
We showed how to express hybrid queries using the Vespa query language and how to combine the two retrieval strategies using the Vespa ranking framework. We also showed how to evaluate the effectiveness of the hybrid ranking model using one of the datasets that are a part of the BEIR benchmark. We hope this tutorial has given you a good understanding of how to combine different retrieval strategies using Vespa, and that there is not a single silver bullet for all retrieval problems.
11171120

test/_test_config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
# Use a comma-separated list of URLs to test multiple pages as one unit, like the news-tutorial
44

55
urls:
6+
- en/tutorials/hybrid-search.md
67
- en/searching-multi-valued-fields.md
78
- en/reranking-in-searcher.md
89
- en/vespa-quick-start-java.html
@@ -18,7 +19,6 @@ urls:
1819
- >-
1920
en/tutorials/text-search.md,
2021
en/tutorials/text-search-ml.md
21-
2222
2323
# https://docs.vespa.ai/en/operations/multinode-systems.html Tests not implemented for AWS EC2/ECS procedures
2424
# Kubernetes testing is blocked on running minikube on Centos, in Docker - needs more work

0 commit comments

Comments
 (0)