Skip to content

Commit

Permalink
Merge pull request #3337 from vespa-engine/jobergum/add-test
Browse files Browse the repository at this point in the history
feat(CI): add to tests for the new "Hybrid Text Search Tutorial"
  • Loading branch information
kkraune authored Aug 28, 2024
2 parents fe91433 + 210b8b1 commit 6de3f47
Show file tree
Hide file tree
Showing 5 changed files with 97 additions and 48 deletions.
2 changes: 2 additions & 0 deletions _data/sidebar.yml
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,8 @@ docs:
url: /en/tutorials/models-hot-swap.html
- page: Text Search
url: /en/tutorials/text-search.html
- page: Hybrid Text Search
url: /en/tutorials/hybrid-search.html
- page: Text Search ML
url: /en/tutorials/text-search-ml.html
- page: Quick Start
Expand Down
25 changes: 25 additions & 0 deletions en/getting-started.html
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@
A text search tutorial and introduction to text ranking with Vespa using traditional information retrieval techniques like BM25.
</li>

<li><a href="tutorials/hybrid-search.html">Tutorial: Hybrid Text Search</a>.
A search tutorial and introduction to hybrid text ranking with Vespa combining BM25 with text embedding models.
</li>

<li><a href="tutorials/text-search-ml.html">Tutorial: Improving Text Search with Machine Learning</a>.
This tutorial builds
on the <a href="tutorials/text-search.html">text search tutorial</a> but introduces Learning to Rank to improve relevance.
Expand All @@ -56,6 +60,11 @@
combining retrieval over inverted index structures with vector search.
</p>

<strong>RAG (Retrieval-Augmented Generation)</strong>
<p>
Learn how to use Vespa for RAG in the <a href="llms-rag.html#">Retrieval-augmented generation (RAG) in Vespa</a> guide.
</p>

<strong>Recommendation</strong>
<p>
Learn how to use Vespa for content recommendation/personalization in the
Expand All @@ -81,6 +90,16 @@
<li><a href="tensorflow.html">Ranking with TensorFlow models</a></li>
</ul>

<strong>Embedding Model Inference</strong>
<p>
Vespa supports integrating <a href="embedding.html">embedding</a> models, this avoids transfering large amounts of embedding vector data
over the network and allows for efficient serving of embedding models.
<ul>
<li><a href="embedding.html#huggingface-embedder">Huggingface Embedder</a> Use single-vector embedding models from Hugging face</li>
<li><a href="embedding.html#colbert-embedder">ColBERT Embedder</a> Use multi-vector embedding models </li>
<li><a href="embedding.html#splade-embedder">Splade Embedder</a> Use sparse learned single vector embedding models</li>
</ul>

<strong>ML Model Lifecycle</strong>
<p>
The <a href="tutorials/models-hot-swap.html">Models hot swap tutorial</a>
Expand All @@ -92,6 +111,12 @@
<strong>E-Commerce Search</strong>
<p>The <a href="use-case-shopping.html">e-commerce shopping sample application</a> demonstrates Vespa grouping,
true in-place partial updates, custom ranking and more.</p>

<strong>Examples and starting sample applications</strong>
<p>
There are many examples and starting applications on
<a href="https://github.com/vespa-engine/sample-apps/">GitHub</a> and <a href="https://pyvespa.readthedocs.io/en/latest/examples.html">PyVespa examples</a>.
</p>
</td>
</tr>

Expand Down
95 changes: 49 additions & 46 deletions en/tutorials/hybrid-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,10 +135,10 @@ schema doc {
}

field embedding type tensor&lt;bfloat16&gt;(v[384]) {
indexing: input title." ".input text | embed | attribute
attribute {
distance-metric: angular
}
indexing: input title." ".input text | embed | attribute
attribute {
distance-metric: angular
}
}

rank-profile bm25 {
Expand All @@ -149,7 +149,7 @@ schema doc {

rank-profile semantic {
inputs {
query(e) tensor&lt;bfloat16&gt;(v[384])
query(e) tensor&lt;bfloat16&gt;(v[384])
}
first-phase {
expression: closeness(field, embedding)
Expand All @@ -170,7 +170,7 @@ The [string](../reference/schema-reference.html#string) data type represents bot
and there are significant differences between [index and attribute](../text-matching.html#index-and-attribute). The above
schema includes default `match` modes for `attribute` and `index` property for visibility.

Note that we are enabling [BM25](../reference/bm25.html) for `title` and `text`.
Note that we are enabling [BM25](../reference/bm25.html) for `title` and `text`
by including `index: enable-bm25`. The language field is the only field that is not the NFCorpus dataset.
We hardcode its value to "en" since the dataset is English. Using `set_language` avoids automatic language detection and uses the value when processing the other
text fields. Read more in [linguistics](../linguistics.html).
Expand All @@ -189,9 +189,9 @@ Our `embedding` vector field is of [tensor](../tensor-user-guide.html) type with
field embedding type tensor<bfloat16>(v[384]) {
indexing: input title." ".input text | embed arctic | attribute
attribute {
distance-metric: angular
distance-metric: angular
}
}
}
```
The `indexing` expression creates the input to the `embed` inference call (in our example the concatenation of the title and the text field). Since
the dataset is small, we do not specify `index` which would build [HNSW](../approximate-nn-hnsw.html) datastructures for faster (but approximate) vector search. This guide uses [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) as the text embedding model. The model is
Expand Down Expand Up @@ -250,7 +250,7 @@ Some notes about the elements above:
- `<container>` defines the [container cluster](../jdisc/index.html) for document, query and result processing
- `<search>` sets up the [query endpoint](../query-api.html). The default port is 8080.
- `<document-api>` sets up the [document endpoint](../reference/document-v1-api-reference.html) for feeding.
- `component` with type `hugging-face-embedder` configures the embedder in the application package. This include where to fetch the model files from, the prepend
- `component` with type `hugging-face-embedder` configures the embedder in the application package. This includes where to fetch the model files from, the prepend
instructions, and the pooling strategy. See [huggingface-embedder](../embedding.html#huggingface-embedder) for details and other embedders supported.
- `<content>` defines how documents are stored and searched
- `<min-redundancy>` denotes how many copies to keep of each document.
Expand Down Expand Up @@ -316,38 +316,37 @@ $ vespa feed -t http://localhost:8080 vespa-docs.jsonl
</pre>
</div>

On an M1, we expect output like the following:
The output should look like this (rates may vary depending on your machine HW):

<pre>{% highlight json%}
{
"feeder.operation.count": 3633,
"feeder.seconds": 39.723,
"feeder.seconds": 148.515,
"feeder.ok.count": 3633,
"feeder.ok.rate": 91.459,
"feeder.ok.rate": 24.462,
"feeder.error.count": 0,
"feeder.inflight.count": 0,
"http.request.count": 13157,
"http.request.bytes": 21102792,
"http.request.MBps": 0.531,
"http.request.count": 3633,
"http.request.bytes": 2985517,
"http.request.MBps": 0.020,
"http.exception.count": 0,
"http.response.count": 13157,
"http.response.bytes": 1532828,
"http.response.MBps": 0.039,
"http.response.error.count": 9524,
"http.response.latency.millis.min": 0,
"http.response.latency.millis.avg": 1220,
"http.response.latency.millis.max": 13703,
"http.response.count": 3633,
"http.response.bytes": 348320,
"http.response.MBps": 0.002,
"http.response.error.count": 0,
"http.response.latency.millis.min": 316,
"http.response.latency.millis.avg": 787,
"http.response.latency.millis.max": 1704,
"http.response.code.counts": {
"200": 3633,
"429": 9524
"200": 3633
}
}{% endhighlight %}</pre>

Notice:

- `feeder.ok.rate` which is the throughput (Note that this step includes embedding inference). See [embedder-performance](../embedding.html#embedder-performance) for details on embedding inference performance. In this case, embedding inference is the bottleneck for overall indexing throughput.
- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. The `429` are harmless. Vespa asks the client
to slow down the feed speed because of resource contention.
- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. Note that if you observe any `429` responses, these are
harmless. Vespa asks the client to slow down the feed speed because of resource contention.


## Sample queries
Expand All @@ -356,14 +355,16 @@ We can now run a few sample queries to demonstrate various ways to perform searc
<div class="pre-parent">
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
<pre data-test="exec" data-test-assert-contains="PLAIN-2">
$ ir_datasets export beir/nfcorpus/test queries --fields query_id text |head -1
$ ir_datasets export beir/nfcorpus/test queries --fields query_id text | head -1
</pre>
</div>

<pre>
PLAIN-2 Do Cholesterol Statin Drugs Cause Breast Cancer?
</pre>

If you see a pipe related error from the above command, you can safely ignore it.

Here, `PLAIN-2` is the query id of the first test query. We'll use this test query to demonstrate querying Vespa.

### Lexical search with BM25 scoring
Expand Down Expand Up @@ -393,7 +394,7 @@ This query returns the following [JSON result response](../reference/default-res
"id": "toplevel",
"relevance": 1.0,
"fields": {
"totalCount": 65
"totalCount": 46
},
"coverage": {
"coverage": 100,
Expand Down Expand Up @@ -423,7 +424,7 @@ This query returns the following [JSON result response](../reference/default-res
{% endhighlight %}</pre>

The query retrieves and ranks `MED-10` as the most relevant document—notice the `totalCount` which is the number of documents that were retrieved for ranking
phases. In this case, we exposed 65 documents to first-phase ranking, it is higher than our target, but also fewer than the total number of documents that match any query terms.
phases. In this case, we exposed about 50 documents to first-phase ranking, it is higher than our target, but also fewer than the total number of documents that match any query terms.

In the example below, we change the grammar from the default `weakAnd` to `any`, and the query matches 1780, or almost 50% of the indexed documents.

Expand Down Expand Up @@ -542,7 +543,7 @@ This query returns the following [JSON result response](../reference/default-res
}{% endhighlight %}</pre>

The result of this vector-based search differed from the previous sparse keyword search, with a different relevant document at position 1. In this case,
the relevance score is 0.606 and calculated by the `closeness` function in the `semantic` rank-profile.
the relevance score is 0.606 and calculated by the `closeness` function in the `semantic` rank-profile. Note that more documents were retrieved than the `targetHits`.

```
rank-profile semantic {
Expand All @@ -562,7 +563,7 @@ Note that similarity scores of embedding vectors are often optimized via contras

## Evaluate ranking accuracy

The previous section demonstrated how to combine the Vespa query language with rank profiles to
The previous section demonstrated how to combine the Vespa query language with rank profiles
to implement two different retrieval and ranking strategies.

In the following section we evaluate all 323 test queries with both models to compare their overall effectiveness, measured using [nDCG@10](https://en.wikipedia.org/wiki/Discounted_cumulative_gain). `nDCG@10` is the official evaluation metric of the BEIR benchmark and is an appropriate metric for test sets with graded relevance judgments.
Expand Down Expand Up @@ -648,22 +649,22 @@ if __name__ == "__main__":
Then execute the script:
<div class="pre-parent">
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
<pre data-test="exec" data-test-assert-contains="nDCG@10: 0.3">
<pre data-test="exec" data-test-assert-contains="bm25: 0.32">
$ python3 evaluate_ranking.py --ranking bm25 --mode sparse
</pre>
</div>

The script will produce the following output:

<pre>
Ranking metric NDCG@10 for rank profile bm25: 0.3195
Ranking metric NDCG@10 for rank profile bm25: 0.3210
</pre>

Now, we can evaluate the dense model using the same script:

<div class="pre-parent">
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
<pre data-test="exec" data-test-assert-contains="nDCG@10: 0.3">
<pre data-test="exec" data-test-assert-contains="semantic: 0.3">
$ python3 evaluate_ranking.py --ranking semantic --mode dense
</pre>
</div>
Expand All @@ -679,6 +680,8 @@ more [measures](https://ir-measur.es/en/latest/measures.html), for example, incl
metrics = [nDCG@10, P(rel=2)@10]
</pre>

Also note that the exact nDCG@10 values may vary slightly between runs.

## Hybrid Search & Ranking

We demonstrated and evaluated two independent retrieval and ranking strategies in the previous sections.
Expand Down Expand Up @@ -810,7 +813,7 @@ The above query returns the following [JSON result response](../reference/defaul
"id": "toplevel",
"relevance": 1.0,
"fields": {
"totalCount": 105
"totalCount": 87
},
"coverage": {
"coverage": 100,
Expand Down Expand Up @@ -843,7 +846,7 @@ The above query returns the following [JSON result response](../reference/defaul
}{% endhighlight %}</pre>

What is going on here is that we are combining the two top-k query operators using a boolean OR (disjunection).
The `totalCount` is the number of documents retrieved into ranking (About 100, which is higher than 10 + 10).
The `totalCount` is the number of documents retrieved into ranking (About 90, which is higher than 10 + 10).
The `relevance` is the score assigned by `hybrid` rank-profile. Notice that the `matchfeatures` field shows all the feature scores. This is
useful for debugging and understanding the ranking behavior, also for feature logging.

Expand Down Expand Up @@ -923,7 +926,7 @@ $ python3 evaluate_ranking.py --ranking hybrid --mode hybrid
Which outputs

<pre>
Ranking metric NDCG@10 for rank profile hybrid: 0.3275
Ranking metric NDCG@10 for rank profile hybrid: 0.3287
</pre>

The `nDCG@10` score is slightly higher than the profiles that only use one of the ranking strategies.
Expand Down Expand Up @@ -1063,30 +1066,30 @@ $ python3 evaluate_ranking.py --ranking hybrid-sum --mode hybrid
</div>

<pre>
Ranking metric NDCG@10 for rank profile hybrid-sum: 0.3232
Ranking metric NDCG@10 for rank profile hybrid-sum: 0.3244
</pre>


<div class="pre-parent">
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
<pre data-test="exec" data-test-assert-contains="0.33">
<pre data-test="exec" data-test-assert-contains="0.34">
$ python3 evaluate_ranking.py --ranking hybrid-normalize-bm25-with-atan --mode hybrid
</pre>
</div>

<pre>
Ranking metric NDCG@10 for rank profile hybrid-normalize-bm25-with-atan: 0.3386
Ranking metric NDCG@10 for rank profile hybrid-normalize-bm25-with-atan: 0.3410
</pre>

<div class="pre-parent">
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
<pre data-test="exec" data-test-assert-contains="0.31">
<pre data-test="exec" data-test-assert-contains="0.32">
$ python3 evaluate_ranking.py --ranking hybrid-rrf --mode hybrid
</pre>
</div>

<pre>
Ranking metric NDCG@10 for rank profile hybrid-rrf: 0.3176
Ranking metric NDCG@10 for rank profile hybrid-rrf: 0.3207
</pre>

<div class="pre-parent">
Expand All @@ -1097,21 +1100,21 @@ $ python3 evaluate_ranking.py --ranking hybrid-linear-normalize --mode hybrid
</div>

<pre>
Ranking metric NDCG@10 for rank profile hybrid-linear-normalize: 0.3356
Ranking metric NDCG@10 for rank profile hybrid-linear-normalize: 0.3387
</pre>

On this particular dataset, the `hybrid-normalize-bm25-with-atan` rank profile performs the best, but the difference is small. This also demonstrates that hybrid search
and ranking is a complex problem and that the effectiveness of the hybrid model depends on the dataset and the retrieval strategies.

These results (which is the best) might not
transfer to your specific retrieval use case and dataset, so it is important to evaluate the effectiveness of a hybrid model on your specific dataset and having
your own relevance judgments.
transfer to your specific retrieval use case and dataset, so it is important to evaluate the effectiveness of a hybrid model on
your specific dataset.

See [Improving retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for more information on how to collect relevance judgments for your dataset.

### Summary

In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and how to expression hybriding ranking using the Vespa ranking framework.
In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and how to express hybrid ranking using the Vespa ranking framework.

We showed how to express hybrid queries using the Vespa query language and how to combine the two retrieval strategies using the Vespa ranking framework. We also showed how to evaluate the effectiveness of the hybrid ranking model using one of the datasets that are a part of the BEIR benchmark. We hope this tutorial has given you a good understanding of how to combine different retrieval strategies using Vespa, and that there is not a single silver bullet for all retrieval problems.

Expand Down
2 changes: 1 addition & 1 deletion test/_test_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# Use a comma-separated list of URLs to test multiple pages as one unit, like the news-tutorial

urls:
- en/tutorials/hybrid-search.md
- en/searching-multi-valued-fields.md
- en/reranking-in-searcher.md
- en/vespa-quick-start-java.html
Expand All @@ -18,7 +19,6 @@ urls:
- >-
en/tutorials/text-search.md,
en/tutorials/text-search-ml.md
# https://docs.vespa.ai/en/operations/multinode-systems.html Tests not implemented for AWS EC2/ECS procedures
# Kubernetes testing is blocked on running minikube on Centos, in Docker - needs more work
Expand Down
Loading

0 comments on commit 6de3f47

Please sign in to comment.