Merge pull request #3337 from vespa-engine/jobergum/add-test

kkraune · web-flow · commit 6de3f47ebc06 · 2024-08-28T10:09:35.000+02:00
feat(CI): add to tests for the new "Hybrid Text Search Tutorial"
diff --git a/_data/sidebar.yml b/_data/sidebar.yml
@@ -167,6 +167,8 @@ docs:
         url: /en/tutorials/models-hot-swap.html
       - page: Text Search
         url: /en/tutorials/text-search.html
+      - page: Hybrid Text Search
+        url: /en/tutorials/hybrid-search.html
       - page: Text Search ML
         url: /en/tutorials/text-search-ml.html
       - page: Quick Start
diff --git a/en/getting-started.html b/en/getting-started.html
@@ -41,6 +41,10 @@
     A text search tutorial and introduction to text ranking with Vespa using traditional information retrieval techniques like BM25.
     </li>
 
+    <li><a href="tutorials/hybrid-search.html">Tutorial: Hybrid Text Search</a>.
+      A search tutorial and introduction to hybrid text ranking with Vespa combining BM25 with text embedding models.
+      </li>
+
     <li><a href="tutorials/text-search-ml.html">Tutorial: Improving Text Search with Machine Learning</a>.
     This tutorial builds
     on the <a href="tutorials/text-search.html">text search tutorial</a> but introduces Learning to Rank to improve relevance.
@@ -56,6 +60,11 @@
     combining retrieval over inverted index structures with vector search.
   </p>
 
+  <strong>RAG (Retrieval-Augmented Generation)</strong>
+  <p>
+  Learn how to use Vespa for RAG in the <a href="llms-rag.html#">Retrieval-augmented generation (RAG) in Vespa</a> guide. 
+  </p>
+
   <strong>Recommendation</strong>
   <p>
     Learn how to use Vespa for content recommendation/personalization in the
@@ -81,6 +90,16 @@
       <li><a href="tensorflow.html">Ranking with TensorFlow models</a></li>
     </ul>
 
+  <strong>Embedding Model Inference</strong>
+  <p>
+   Vespa supports integrating <a href="embedding.html">embedding</a> models, this avoids transfering large amounts of embedding vector data 
+   over the network and allows for efficient serving of embedding models.
+  <ul>
+    <li><a href="embedding.html#huggingface-embedder">Huggingface Embedder</a> Use single-vector embedding models from Hugging face</li>
+    <li><a href="embedding.html#colbert-embedder">ColBERT Embedder</a> Use multi-vector embedding models </li>
+    <li><a href="embedding.html#splade-embedder">Splade Embedder</a> Use sparse learned single vector embedding models</li>
+  </ul>
+  
   <strong>ML Model Lifecycle</strong>
   <p>
     The <a href="tutorials/models-hot-swap.html">Models hot swap tutorial</a>
@@ -92,6 +111,12 @@
   <strong>E-Commerce Search</strong>
   <p>The <a href="use-case-shopping.html">e-commerce shopping sample application</a> demonstrates Vespa grouping,
   true in-place partial updates, custom ranking and more.</p>
+
+  <strong>Examples and starting sample applications</strong>
+<p>
+There are many examples and starting applications on
+<a href="https://github.com/vespa-engine/sample-apps/">GitHub</a> and <a href="https://pyvespa.readthedocs.io/en/latest/examples.html">PyVespa examples</a>.
+</p>
   </td>
 </tr>
 
diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md
@@ -135,10 +135,10 @@ schema doc {
     }
     
     field embedding type tensor&lt;bfloat16&gt;(v[384]) {
-      indexing: input title." ".input text | embed | attribute
-      attribute {
-        distance-metric: angular
-      }
+        indexing: input title." ".input text | embed | attribute
+        attribute {
+            distance-metric: angular
+        }
     }
   
     rank-profile bm25 {
@@ -149,7 +149,7 @@ schema doc {
 
     rank-profile semantic {
         inputs {
-          query(e) tensor&lt;bfloat16&gt;(v[384])
+            query(e) tensor&lt;bfloat16&gt;(v[384])
         }
         first-phase {
             expression: closeness(field, embedding)
@@ -170,7 +170,7 @@ The [string](../reference/schema-reference.html#string) data type represents bot
 and there are significant differences between [index and attribute](../text-matching.html#index-and-attribute). The above
 schema includes default `match` modes for `attribute` and `index` property for visibility.  
 
-Note that we are enabling [BM25](../reference/bm25.html) for `title` and `text`.
+Note that we are enabling [BM25](../reference/bm25.html) for `title` and `text`
 by including `index: enable-bm25`. The language field is the only field that is not the NFCorpus dataset. 
 We hardcode its value to "en" since the dataset is English. Using `set_language` avoids automatic language detection and uses the value when processing the other
 text fields. Read more in [linguistics](../linguistics.html).
@@ -189,9 +189,9 @@ Our `embedding` vector field is of [tensor](../tensor-user-guide.html) type with
 field embedding type tensor<bfloat16>(v[384]) {
       indexing: input title." ".input text | embed arctic | attribute
       attribute {
-        distance-metric: angular
+          distance-metric: angular
       }
-    }
+}
 ```
 The `indexing` expression creates the input to the `embed` inference call (in our example the concatenation of the title and the text field). Since
 the dataset is small, we do not specify `index` which would build [HNSW](../approximate-nn-hnsw.html) datastructures for faster (but approximate) vector search. This guide uses [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) as the text embedding model. The model is
@@ -250,7 +250,7 @@ Some notes about the elements above:
 - `<container>` defines the [container cluster](../jdisc/index.html) for document, query and result processing
 - `<search>` sets up the [query endpoint](../query-api.html).  The default port is 8080.
 - `<document-api>` sets up the [document endpoint](../reference/document-v1-api-reference.html) for feeding.
-- `component` with type `hugging-face-embedder` configures the embedder in the application package. This include where to fetch the model files from, the prepend
+- `component` with type `hugging-face-embedder` configures the embedder in the application package. This includes where to fetch the model files from, the prepend
 instructions, and the pooling strategy. See [huggingface-embedder](../embedding.html#huggingface-embedder) for details and other embedders supported.
 - `<content>` defines how documents are stored and searched
 - `<min-redundancy>` denotes how many copies to keep of each document.
@@ -316,38 +316,37 @@ $ vespa feed -t http://localhost:8080 vespa-docs.jsonl
 </pre>
 </div>
 
-On an M1, we expect output like the following:
+The output should look like this (rates may vary depending on your machine HW):
 
 <pre>{% highlight json%}
 {
   "feeder.operation.count": 3633,
-  "feeder.seconds": 39.723,
+  "feeder.seconds": 148.515,
   "feeder.ok.count": 3633,
-  "feeder.ok.rate": 91.459,
+  "feeder.ok.rate": 24.462,
   "feeder.error.count": 0,
   "feeder.inflight.count": 0,
-  "http.request.count": 13157,
-  "http.request.bytes": 21102792,
-  "http.request.MBps": 0.531,
+  "http.request.count": 3633,
+  "http.request.bytes": 2985517,
+  "http.request.MBps": 0.020,
   "http.exception.count": 0,
-  "http.response.count": 13157,
-  "http.response.bytes": 1532828,
-  "http.response.MBps": 0.039,
-  "http.response.error.count": 9524,
-  "http.response.latency.millis.min": 0,
-  "http.response.latency.millis.avg": 1220,
-  "http.response.latency.millis.max": 13703,
+  "http.response.count": 3633,
+  "http.response.bytes": 348320,
+  "http.response.MBps": 0.002,
+  "http.response.error.count": 0,
+  "http.response.latency.millis.min": 316,
+  "http.response.latency.millis.avg": 787,
+  "http.response.latency.millis.max": 1704,
   "http.response.code.counts": {
-    "200": 3633,
-    "429": 9524
+    "200": 3633
   }
 }{% endhighlight %}</pre>
 
 Notice:
 
 - `feeder.ok.rate` which is the throughput (Note that this step includes embedding inference). See [embedder-performance](../embedding.html#embedder-performance) for details on embedding inference performance. In this case, embedding inference is the bottleneck for overall indexing throughput. 
-- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. The `429` are harmless. Vespa asks the client
-to slow down the feed speed because of resource contention.
+- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. Note that if you observe any `429` responses, these are 
+harmless. Vespa asks the client to slow down the feed speed because of resource contention.
 
 
 ## Sample queries 
@@ -356,14 +355,16 @@ We can now run a few sample queries to demonstrate various ways to perform searc
 <div class="pre-parent">
   <button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
 <pre data-test="exec" data-test-assert-contains="PLAIN-2">
-$ ir_datasets export beir/nfcorpus/test queries --fields query_id text |head -1
+$ ir_datasets export beir/nfcorpus/test queries --fields query_id text | head -1
 </pre>
 </div> 
 
 <pre>
 PLAIN-2	Do Cholesterol Statin Drugs Cause Breast Cancer?
 </pre>
 
+If you see a pipe related error from the above command, you can safely ignore it.
+
 Here, `PLAIN-2` is the query id of the first test query. We'll use this test query to demonstrate querying Vespa.
 
 ### Lexical search with BM25 scoring
@@ -393,7 +394,7 @@ This query returns the following [JSON result response](../reference/default-res
         "id": "toplevel",
         "relevance": 1.0,
         "fields": {
-            "totalCount": 65
+            "totalCount": 46
         },
         "coverage": {
             "coverage": 100,
@@ -423,7 +424,7 @@ This query returns the following [JSON result response](../reference/default-res
 {% endhighlight %}</pre>
 
 The query retrieves and ranks `MED-10` as the most relevant document—notice the `totalCount` which is the number of documents that were retrieved for ranking
-phases. In this case, we exposed 65 documents to first-phase ranking, it is higher than our target, but also fewer than the total number of documents that match any query terms. 
+phases. In this case, we exposed about 50 documents to first-phase ranking, it is higher than our target, but also fewer than the total number of documents that match any query terms. 
 
 In the example below, we change the grammar from the default `weakAnd` to `any`, and the query matches 1780, or almost 50% of the indexed documents. 
 
@@ -542,7 +543,7 @@ This query returns the following [JSON result response](../reference/default-res
 }{% endhighlight %}</pre>
 
 The result of this vector-based search differed from the previous sparse keyword search, with a different relevant document at position 1. In this case, 
-the relevance score is 0.606 and calculated by the `closeness` function in the `semantic` rank-profile.
+the relevance score is 0.606 and calculated by the `closeness` function in the `semantic` rank-profile. Note that more documents were retrieved than the `targetHits`.
 
 ```
 rank-profile semantic {
@@ -562,7 +563,7 @@ Note that similarity scores of embedding vectors are often optimized via contras
 
 ## Evaluate ranking accuracy 
 
-The previous section demonstrated how to combine the Vespa query language with rank profiles to
+The previous section demonstrated how to combine the Vespa query language with rank profiles
 to implement two different retrieval and ranking strategies.
 
 In the following section we evaluate all 323 test queries with both models to compare their overall effectiveness, measured using [nDCG@10](https://en.wikipedia.org/wiki/Discounted_cumulative_gain). `nDCG@10` is the official evaluation metric of the BEIR benchmark and is an appropriate metric for test sets with graded relevance judgments. 
@@ -648,22 +649,22 @@ if __name__ == "__main__":
 Then execute the script:
 <div class="pre-parent">
   <button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
-<pre data-test="exec" data-test-assert-contains="nDCG@10: 0.3">
+<pre data-test="exec" data-test-assert-contains="bm25: 0.32">
 $ python3 evaluate_ranking.py --ranking bm25 --mode sparse
 </pre>
 </div>
 
 The script will produce the following output:
 
 <pre>
-Ranking metric NDCG@10 for rank profile bm25: 0.3195
+Ranking metric NDCG@10 for rank profile bm25: 0.3210 
 </pre>
 
 Now, we can evaluate the dense model using the same script:
 
 <div class="pre-parent">
   <button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
-<pre data-test="exec" data-test-assert-contains="nDCG@10: 0.3">
+<pre data-test="exec" data-test-assert-contains="semantic: 0.3">
 $ python3 evaluate_ranking.py --ranking semantic --mode dense
 </pre>
 </div>
@@ -679,6 +680,8 @@ more [measures](https://ir-measur.es/en/latest/measures.html), for example, incl
 metrics = [nDCG@10, P(rel=2)@10]
 </pre>
 
+Also note that the exact nDCG@10 values may vary slightly between runs. 
+
 ## Hybrid Search & Ranking
 
 We demonstrated and evaluated two independent retrieval and ranking strategies in the previous sections. 
@@ -810,7 +813,7 @@ The above query returns the following [JSON result response](../reference/defaul
         "id": "toplevel",
         "relevance": 1.0,
         "fields": {
-            "totalCount": 105
+            "totalCount": 87
         },
         "coverage": {
             "coverage": 100,
@@ -843,7 +846,7 @@ The above query returns the following [JSON result response](../reference/defaul
 }{% endhighlight %}</pre>
 
 What is going on here is that we are combining the two top-k query operators using a boolean OR (disjunection). 
-The `totalCount` is the number of documents retrieved into ranking (About 100, which is higher than 10 + 10). 
+The `totalCount` is the number of documents retrieved into ranking (About 90, which is higher than 10 + 10). 
 The `relevance` is the score assigned by `hybrid` rank-profile. Notice that the `matchfeatures` field shows all the feature scores. This is
 useful for debugging and understanding the ranking behavior, also for feature logging.
 
@@ -923,7 +926,7 @@ $ python3 evaluate_ranking.py --ranking hybrid --mode hybrid
 Which outputs
 
 <pre>
-Ranking metric NDCG@10 for rank profile hybrid: 0.3275
+Ranking metric NDCG@10 for rank profile hybrid: 0.3287
 </pre>
 
 The `nDCG@10` score is slightly higher than the profiles that only use one of the ranking strategies.  
@@ -1063,30 +1066,30 @@ $ python3 evaluate_ranking.py --ranking hybrid-sum --mode hybrid
 </div>
 
 <pre>
-Ranking metric NDCG@10 for rank profile hybrid-sum: 0.3232
+Ranking metric NDCG@10 for rank profile hybrid-sum: 0.3244
 </pre>
 
 
 <div class="pre-parent">
   <button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
-<pre data-test="exec" data-test-assert-contains="0.33">
+<pre data-test="exec" data-test-assert-contains="0.34">
 $ python3 evaluate_ranking.py --ranking hybrid-normalize-bm25-with-atan --mode hybrid
 </pre>
 </div>
 
 <pre>
-Ranking metric NDCG@10 for rank profile hybrid-normalize-bm25-with-atan: 0.3386
+Ranking metric NDCG@10 for rank profile hybrid-normalize-bm25-with-atan: 0.3410
 </pre>
 
 <div class="pre-parent">
   <button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
-<pre data-test="exec" data-test-assert-contains="0.31">
+<pre data-test="exec" data-test-assert-contains="0.32">
 $ python3 evaluate_ranking.py --ranking hybrid-rrf --mode hybrid
 </pre>
 </div>
 
 <pre>
-Ranking metric NDCG@10 for rank profile hybrid-rrf: 0.3176
+Ranking metric NDCG@10 for rank profile hybrid-rrf: 0.3207
 </pre>
 
 <div class="pre-parent">
@@ -1097,21 +1100,21 @@ $ python3 evaluate_ranking.py --ranking hybrid-linear-normalize --mode hybrid
 </div>
 
 <pre>
-Ranking metric NDCG@10 for rank profile hybrid-linear-normalize: 0.3356
+Ranking metric NDCG@10 for rank profile hybrid-linear-normalize: 0.3387
 </pre>
 
 On this particular dataset, the `hybrid-normalize-bm25-with-atan` rank profile performs the best, but the difference is small. This also demonstrates that hybrid search 
 and ranking is a complex problem and that the effectiveness of the hybrid model depends on the dataset and the retrieval strategies. 
 
 These results (which is the best) might not
-transfer to your specific retrieval use case and dataset, so it is important to evaluate the effectiveness of a hybrid model on your specific dataset and having
-your own relevance judgments. 
+transfer to your specific retrieval use case and dataset, so it is important to evaluate the effectiveness of a hybrid model on 
+your specific dataset.
 
 See [Improving retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for more information on how to collect relevance judgments for your dataset.
 
 ### Summary
 
-In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and how to expression hybriding ranking using the Vespa ranking framework. 
+In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and how to express hybrid ranking using the Vespa ranking framework.
 
 We showed how to express hybrid queries using the Vespa query language and how to combine the two retrieval strategies using the Vespa ranking framework. We also showed how to evaluate the effectiveness of the hybrid ranking model using one of the datasets that are a part of the BEIR benchmark. We hope this tutorial has given you a good understanding of how to combine different retrieval strategies using Vespa, and that there is not a single silver bullet for all retrieval problems.
 
diff --git a/test/_test_config.yml b/test/_test_config.yml
@@ -3,6 +3,7 @@
 # Use a comma-separated list of URLs to test multiple pages as one unit, like the news-tutorial
 
 urls:
+    - en/tutorials/hybrid-search.md
     - en/searching-multi-valued-fields.md
     - en/reranking-in-searcher.md
     - en/vespa-quick-start-java.html
@@ -18,7 +19,6 @@ urls:
     - >-
         en/tutorials/text-search.md,
         en/tutorials/text-search-ml.md
-
     
     # https://docs.vespa.ai/en/operations/multinode-systems.html Tests not implemented for AWS EC2/ECS procedures
     # Kubernetes testing is blocked on running minikube on Centos, in Docker - needs more work
diff --git a/test/test.py b/test/test.py