From 4f6557826d8e20717ba99ee63e828186dfba1a6c Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Thu, 16 May 2024 10:18:02 +0200 Subject: [PATCH 01/10] Add cleanup --- en/tutorials/hybrid-search.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index 386272ac2e..884f1b74fe 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -292,6 +292,13 @@ $ vespa feed -t http://localhost:8080 vespa-docs.jsonl +
+ +
+$ docker rm -f vespa-hybrid
+
+
+ [^1]: Robertson, Stephen and Zaragoza, Hugo and others, 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval. From 82b066aeed534c3693bac1a4548d48e2853d2930 Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Thu, 16 May 2024 12:57:32 +0200 Subject: [PATCH 02/10] More content and mayb tests pass as well --- en/tutorials/hybrid-search.md | 246 ++++++++++++++++++++++++++++++++-- 1 file changed, 238 insertions(+), 8 deletions(-) diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index 884f1b74fe..1a5f94d342 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -10,7 +10,7 @@ The main goal is to set up a text search app that combines simple text scoring f such as [BM25](../reference/bm25.html) [^1] with vector search in combination with text-embedding models. We demonstrate obtaining the text embeddings within Vespa using Vespa's [embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) functionality. In this guide, we use [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) as the -text embedding model. +text embedding model. We can also recommend following the [text-search](text-search.html) tutorial first. For demonstration purposes, we use the small IR dataset that is part of the [BEIR](https://github.com/beir-cellar/beir) benchmark: [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/). The BEIR version has 2590 train queries, 323 test queries, and 3633 documents. In these experiments we only use the test queries to evaluate various hybrid search techniques. Later tutorials will demonstrate how to use the train split to learn how to rank documents. @@ -116,7 +116,7 @@ schema doc { index: enable-bm25 } field text type string { - indexing: index + indexing: index | summary match: text index: enable-bm25 } @@ -126,8 +126,8 @@ schema doc { } field embedding type tensor<bfloat16>(v[384]) { - indexing: input title . " " . input text | embed | summary | attribute - attribute: { + indexing: input title . " " . input text | embed | attribute + attribute { distance-metric: angular } } @@ -198,7 +198,7 @@ Write the following to `app/services.xml`: <component id="arctic" type="hugging-face-embedder"> <transformer-model url="https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/resolve/main/onnx/model_quantized.onnx"/> <tokenizer-model url="https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/raw/main/tokenizer.json"/> - <pooling>cls</pooling> + <pooling-strategy>cls</pooling-strategy> <prepend> <query>Represent this sentence for searching relevant passages: </query> </prepend> @@ -248,7 +248,7 @@ $ docker run --detach --name vespa-hybrid --hostname vespa-container \ -Notice that we publish two ports (:8080) is the data-plane port where we write and query documents, and 19071 is +Notice that we publish two ports: 8080 is the data-plane where we write and query documents, and 19071 is the control-plane where we can deploy the application. Configure the Vespa CLI to use the local container: @@ -273,7 +273,7 @@ Now, deploy the Vespa application from the `app` directory:
-
+
 $ vespa deploy --wait 300 app
 
@@ -281,7 +281,8 @@ $ vespa deploy --wait 300 app ## Feed the data -The data fed to Vespa must match the document type in the schema. +The data fed to Vespa must match the document type in the schema. This steps also performs embed inference inside Vespa +using the snowflake arctic embedding model. Remember the `component` definition in `services.xml` and the `embed` call in the schema.
@@ -291,6 +292,235 @@ $ vespa feed -t http://localhost:8080 vespa-docs.jsonl
+On an M1 we expect output like the following: + +
{% highlight json%}
+{
+  "feeder.operation.count": 3633,
+  "feeder.seconds": 39.723,
+  "feeder.ok.count": 3633,
+  "feeder.ok.rate": 91.459,
+  "feeder.error.count": 0,
+  "feeder.inflight.count": 0,
+  "http.request.count": 13157,
+  "http.request.bytes": 21102792,
+  "http.request.MBps": 0.531,
+  "http.exception.count": 0,
+  "http.response.count": 13157,
+  "http.response.bytes": 1532828,
+  "http.response.MBps": 0.039,
+  "http.response.error.count": 9524,
+  "http.response.latency.millis.min": 0,
+  "http.response.latency.millis.avg": 1220,
+  "http.response.latency.millis.max": 13703,
+  "http.response.code.counts": {
+    "200": 3633,
+    "429": 9524
+  }
+}{% endhighlight %}
+ +Notice: + +- `feeder.ok.rate` which is the throughput (including embedding inference). See [embedder-performance](../embedding.html#embedder-performance) for details on embedding inference performance. In this case, embedding inference is the bottleneck for overall indexing throughput. +- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. The `429` are harmless and is Vespa asking the client +to slow down feed speed because all resources are occupied. + + +## Sample queries +We can now run a few sample queries to demonstrate various ways to perform searches over this data using Vespa query language. + +
+ +
+$ ir_datasets export beir/nfcorpus/test queries | head -1
+
+
+ +
+PLAIN-2	Do Cholesterol Statin Drugs Cause Breast Cancer?
+
+ +Here, `PLAIN-2` is the query id of the first test query. We'll use this test query to demonstrate querying Vespa. + +### Sparse search using keywords with bm25 scoring +The following query uses [weakAnd](../using-wand-with-vespa.html) and where `targetHits` is a hint +of how many documents we want to expose to configurable [ranking phases](../phased-ranking.html). + +
+ +
+$ vespa query \
+  'yql=select * from doc where {targetHits:10}userInput(@user-query)' \
+  'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \
+  'hits=1' \
+  'language=en' \
+  'ranking=bm25'
+
+
+ +Notice that we choose `ranking` to specify which rank profile to rank the documents retrieved by the query. + +
{% highlight json %}
+{
+    "root": {
+        "id": "toplevel",
+        "relevance": 1.0,
+        "fields": {
+            "totalCount": 65
+        },
+        "coverage": {
+            "coverage": 100,
+            "documents": 3633,
+            "full": true,
+            "nodes": 1,
+            "results": 1,
+            "resultsFull": 1
+        },
+        "children": [
+            {
+                "id": "id:doc:doc::MED-10",
+                "relevance": 25.521817426330887,
+                "source": "content",
+                "fields": {
+                    "sddocname": "doc",
+                    "documentid": "id:doc:doc::MED-10",
+                    "doc_id": "MED-10",
+                    "title": "Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland",
+                    "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients."
+                }
+            }
+        ]
+    }
+}
+
+{% endhighlight %}
+ +The query retrieves and ranks `MED-10` as the most relevant document—notice the `totalCount` which is the number of documents that were retrieved for ranking +phases. In this case, we exposed 65 documents, it is higher than our target, but also much fewer than the total number of documents that match any query terms like below, changing the +grammar from the default `weakAnd` to `any` matches 1780, or almost 50% of the indexed documents. + +
+ +
+$ vespa query \
+  'yql=select * from doc where {targetHits:10, grammar:"any"}userInput(@user-query)' \
+  'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \
+  'hits=1' \
+  'language=en' \
+  'ranking=bm25'
+
+
+ +The bm25 profile calculates the relevance score: + +
+rank-profile bm25 {
+        first-phase {
+            expression: bm25(title) + bm25(text)
+        }
+    }
+
+ +So, in this case, `relevance` is the sum of the two BM25 scores. The retrieved document looks relevant, we can look at the judgement for this query `PLAIN-2`: + +
+ +
+$ ir_datasets export beir/nfcorpus/test qrels |grep "PLAIN-2 "
+
+
+ +This lists documents that have been judged for the query `PLAIN-2`. Notice line two, the MED-10 document is judged as very relevant with the grade 2 for the query PLAIN-2. +This dataset has graded relevance judgments where a grade of 1 is less relevant than 2. Our task is to develop a ranking model that ranks all the highly relevant documents (grade 2) before the ones with grade 1. + +
+PLAIN-2 0 MED-2427 2
+PLAIN-2 0 MED-10 2
+PLAIN-2 0 MED-2429 2
+PLAIN-2 0 MED-2430 2
+PLAIN-2 0 MED-2431 2
+PLAIN-2 0 MED-14 2
+PLAIN-2 0 MED-2432 2
+PLAIN-2 0 MED-2428 1
+PLAIN-2 0 MED-2440 1
+PLAIN-2 0 MED-2434 1
+PLAIN-2 0 MED-2435 1
+PLAIN-2 0 MED-2436 1
+PLAIN-2 0 MED-2437 1
+PLAIN-2 0 MED-2438 1
+PLAIN-2 0 MED-2439 1
+PLAIN-2 0 MED-3597 1
+PLAIN-2 0 MED-3598 1
+PLAIN-2 0 MED-3599 1
+PLAIN-2 0 MED-4556 1
+PLAIN-2 0 MED-4559 1
+PLAIN-2 0 MED-4560 1
+PLAIN-2 0 MED-4828 1
+PLAIN-2 0 MED-4829 1
+PLAIN-2 0 MED-4830 1
+
+ +### Dense search using vector search + +Now, we turn to embedding-based retrieval, where we embed the query text using the configured text-embedding model and perform +an exact nearestNeighbor search. We use [embed query](.//embedding.html#embedding-a-query-text) to produce the +input tensor `query(e)` that was defined in the `semantic` rank-profile. + +
+ +
+$ vespa query \
+  'yql=select * from doc where {targetHits:10}nearestNeighbor(embedding,e)' \
+  'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \
+  'input.query(e)=embed(@user-query)' \
+  'hits=1' \
+  'ranking=semantic'
+
+
+ +This query returns the following, also in this case, we got more hits exposed to ranking than our target. + +
{% highlight json %}
+{
+    "root": {
+        "id": "toplevel",
+        "relevance": 1.0,
+        "fields": {
+            "totalCount": 64
+        },
+        "coverage": {
+            "coverage": 100,
+            "documents": 3633,
+            "full": true,
+            "nodes": 1,
+            "results": 1,
+            "resultsFull": 1
+        },
+        "children": [
+            {
+                "id": "id:doc:doc::MED-2429",
+                "relevance": 0.6061378635706601,
+                "source": "content",
+                "fields": {
+                    "sddocname": "doc",
+                    "documentid": "id:doc:doc::MED-2429",
+                    "doc_id": "MED-2429",
+                    "title": "Statin use and risk of breast cancer: a meta-analysis of observational studies.",
+                    "text": "Emerging evidence suggests that statins' may decrease the risk of cancers. However, available evidence on breast cancer is conflicting. We, therefore, examined the association between statin use and risk of breast cancer by conducting a detailed meta-analysis of all observational studies published regarding this subject. PubMed database and bibliographies of retrieved articles were searched for epidemiological studies published up to January 2012, investigating the relationship between statin use and breast cancer. Before meta-analysis, the studies were evaluated for publication bias and heterogeneity. Combined relative risk (RR) and 95 % confidence interval (CI) were calculated using a random-effects model (DerSimonian and Laird method). Subgroup analyses, sensitivity analysis, and cumulative meta-analysis were also performed. A total of 24 (13 cohort and 11 case-control) studies involving more than 2.4 million participants, including 76,759 breast cancer cases contributed to this analysis. We found no evidence of publication bias and evidence of heterogeneity among the studies. Statin use and long-term statin use did not significantly affect breast cancer risk (RR = 0.99, 95 % CI = 0.94, 1.04 and RR = 1.03, 95 % CI = 0.96, 1.11, respectively). When the analysis was stratified into subgroups, there was no evidence that study design substantially influenced the effect estimate. Sensitivity analysis confirmed the stability of our results. Cumulative meta-analysis showed a change in trend of reporting risk of breast cancer from positive to negative in statin users between 1993 and 2011. Our meta-analysis findings do not support the hypothesis that statins' have a protective effect against breast cancer. More randomized clinical trials and observational studies are needed to confirm this association with underlying biological mechanisms in the future."
+                }
+            }
+        ]
+    }
+}{% endhighlight %}
+ +The result of this vector based search differed from the previous sparse keyword search, with a different document ranked at the top. This top-ranking document, labeled as 'MED-2429', is also considered highly relevant based on the graded judgments. + +## Evaluate ranking accuracy + + + + +## Cleanup
From 63eb017752ae441f960574d1b983e1462ba7bb5f Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Thu, 16 May 2024 14:13:51 +0200 Subject: [PATCH 03/10] always something --- en/tutorials/hybrid-search.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index 1a5f94d342..c9eedb73be 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -10,7 +10,7 @@ The main goal is to set up a text search app that combines simple text scoring f such as [BM25](../reference/bm25.html) [^1] with vector search in combination with text-embedding models. We demonstrate obtaining the text embeddings within Vespa using Vespa's [embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) functionality. In this guide, we use [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) as the -text embedding model. We can also recommend following the [text-search](text-search.html) tutorial first. +text embedding model. For demonstration purposes, we use the small IR dataset that is part of the [BEIR](https://github.com/beir-cellar/beir) benchmark: [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/). The BEIR version has 2590 train queries, 323 test queries, and 3633 documents. In these experiments we only use the test queries to evaluate various hybrid search techniques. Later tutorials will demonstrate how to use the train split to learn how to rank documents. @@ -126,7 +126,7 @@ schema doc { } field embedding type tensor<bfloat16>(v[384]) { - indexing: input title . " " . input text | embed | attribute + indexing: input title." ".input text | embed | attribute attribute { distance-metric: angular } @@ -344,7 +344,8 @@ Here, `PLAIN-2` is the query id of the first test query. We'll use this test que ### Sparse search using keywords with bm25 scoring The following query uses [weakAnd](../using-wand-with-vespa.html) and where `targetHits` is a hint -of how many documents we want to expose to configurable [ranking phases](../phased-ranking.html). +of how many documents we want to expose to configurable [ranking phases](../phased-ranking.html). Refer +to [text search tutorial](text-search.html#querying-the-data) for more on querying with `userInput`.
@@ -513,13 +514,18 @@ This query returns the following, also in this case, we got more hits exposed to } }{% endhighlight %} -The result of this vector based search differed from the previous sparse keyword search, with a different document ranked at the top. This top-ranking document, labeled as 'MED-2429', is also considered highly relevant based on the graded judgments. +The result of this vector-based search differed from the previous sparse keyword search, with a different document ranked at the top. This top-ranking document, labeled as 'MED-2429', is also considered highly relevant based on the graded judgments. ## Evaluate ranking accuracy +Now, we looked at two ways to retrieve and rank the results. Now, we need to evaluate all 323 test queries, and then we can compare their effectiveness. +## Hybrid + + + ## Cleanup
From 77d86886796894f1e113eb45e36f847b2a64ac17 Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Thu, 16 May 2024 14:52:33 +0200 Subject: [PATCH 04/10] More content and let us try to install ir_measures --- en/tutorials/hybrid-search.md | 69 ++++++++++++++++++++++++++++++++--- 1 file changed, 63 insertions(+), 6 deletions(-) diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index c9eedb73be..9bcac4ab4a 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -26,7 +26,7 @@ This tutorial uses [Vespa-CLI](../vespa-cli.html) to deploy, feed, and query Ves
-$ pip3 install --ignore-installed vespacli ir_datasets
+$ pip3 install --ignore-installed vespacli ir_datasets ir_measures
 
@@ -422,7 +422,8 @@ rank-profile bm25 { } -So, in this case, `relevance` is the sum of the two BM25 scores. The retrieved document looks relevant, we can look at the judgement for this query `PLAIN-2`: +So, in this case, `relevance` is the sum of the two BM25 scores. The retrieved document looks relevant; we can look at the graded judgment for this query `PLAIN-2`. The +following exports the query relevance judgments and we grep for the query id that we are interested in:
@@ -431,8 +432,8 @@ $ ir_datasets export beir/nfcorpus/test qrels |grep "PLAIN-2 "
-This lists documents that have been judged for the query `PLAIN-2`. Notice line two, the MED-10 document is judged as very relevant with the grade 2 for the query PLAIN-2. -This dataset has graded relevance judgments where a grade of 1 is less relevant than 2. Our task is to develop a ranking model that ranks all the highly relevant documents (grade 2) before the ones with grade 1. +This lists documents judged for the query `PLAIN-2`. Notice line two, the MED-10 document is judged as very relevant with the grade 2 for the query PLAIN-2. +This dataset has graded relevance judgments where a grade of 1 is less relevant than 2.
 PLAIN-2 0 MED-2427 2
@@ -469,7 +470,7 @@ input tensor `query(e)` that was defined in the `semantic` rank-profile.
 
 
-
+
 $ vespa query \
   'yql=select * from doc where {targetHits:10}nearestNeighbor(embedding,e)' \
   'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \
@@ -479,7 +480,7 @@ $ vespa query \
 
-This query returns the following, also in this case, we got more hits exposed to ranking than our target. +This query returns the following response:
{% highlight json %}
 {
@@ -519,8 +520,64 @@ The result of this vector-based search differed from the previous sparse keyword
 ## Evaluate ranking accuracy 
 Now, we looked at two ways to retrieve and rank the results. Now,  we need to evaluate all 323 test queries, and then we can compare their effectiveness. 
 
+For this we write a small script that combines ir_datasets with ir_measures. Save this to `evaluate_ranking.py`
 
+
+ +
+import requests
+import ir_measures
+import ir_datasets
+from ir_measures import nDCG, P, R
+
+
+def parse_response(response, qid):
+    run = []
+    hits = response['root'].get('children',[])
+    for hit in hits:
+      id = hit['fields']['doc_id']
+      relevance = hit['relevance']
+      run.append(ir_measures.ScoredDoc(qid, id, relevance))
+    return run
+
+def search(query,qid, ranking, hits=10, language="en"):    
+    query_request = {
+        'yql': 'select doc_id from doc where ({targetHits:10}userInput(@user-query))',
+        'user-query': query, 
+        'ranking': ranking,
+        'hits' : hits, 
+        'language': language
+    }
+    response = requests.post("http://localhost:8080/search/", json=query_request)
+    if response.ok:
+        return parse_response(response.json(), qid)
+    else:
+      print("Search request failed with response " + str(response.json()))
+      return []
+
+def main():
+  dataset = ir_datasets.load("beir/nfcorpus/test")
+  runs = []
+  for query in dataset.queries_iter():
+    qid = query.query_id
+    query_text = query.text
+    run = search(query_text,qid, "bm25", hits=100)
+    runs.extend(run)
+  metric = ir_measures.calc_aggregate([nDCG@10, P@10, R@100], dataset.qrels, runs)
+  print(metric)
+if __name__ == "__main__":
+    main()
+
+
+Then execute the script, which runs the queries and calculates three IR-related metrics + +
+ +
+$ python3 evaluate_ranking.py
+
+
## Hybrid From 2aaecd5a9af86c38f2b6ba137c14d3b45fd2160e Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Tue, 21 May 2024 15:37:06 +0200 Subject: [PATCH 05/10] more content and also style changes --- css/style.scss | 20 ++- en/tutorials/hybrid-search.md | 241 +++++++++++++++++++++------------- js/process_pre.js | 4 +- 3 files changed, 174 insertions(+), 91 deletions(-) diff --git a/css/style.scss b/css/style.scss index bfc908c09f..e2cf45852a 100644 --- a/css/style.scss +++ b/css/style.scss @@ -500,6 +500,24 @@ table { background-color: $color-brand-300; } +.filepath { + -webkit-font-smoothing: auto; + overflow: auto; + page-break-inside: avoid; + display: block; + line-height: 1.42857143; + color: #d0d0d0; + word-break: break-all; + word-wrap: break-word; + border: 1px solid #404040; + white-space: pre-wrap; + background-color: #404040; + border-radius: 4px; + padding: 14px; + margin-bottom: 10px; + margin-top: 10px; +} + /* Query results */ .search-result-list { @@ -610,4 +628,4 @@ blockquote { background-color: #e8e8e8; padding: 5px 5px 5px 5px; border-radius: 5px; -} \ No newline at end of file +} diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index 9bcac4ab4a..dddf4e00ba 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -7,13 +7,13 @@ redirect_from: This tutorial will guide you through setting up a hybrid text search application. The main goal is to set up a text search app that combines simple text scoring features -such as [BM25](../reference/bm25.html) [^1] with vector search in combination with text-embedding models. We -demonstrate obtaining the text embeddings within Vespa using Vespa's [embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) +such as [BM25](../reference/bm25.html) [^1] with vector search in combination with text-embedding models. +We demonstrate how to obtain text embeddings within Vespa using Vespa's [embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) functionality. In this guide, we use [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) as the text embedding model. -For demonstration purposes, we use the small IR dataset that is part of the [BEIR](https://github.com/beir-cellar/beir) benchmark: [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/). The BEIR version has 2590 train queries, 323 test queries, and 3633 documents. In these experiments -we only use the test queries to evaluate various hybrid search techniques. Later tutorials will demonstrate how to use the train split to learn how to rank documents. +For demonstration purposes, we use the small IR dataset that is part of the [BEIR](https://github.com/beir-cellar/beir) benchmark: [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/). The BEIR version of this dataset has 2590 train queries, 323 test queries, and 3633 documents. In these experiments +we only use the test queries. Later tutorials will demonstrate how to use the train split to learn how to rank documents. {% include pre-req.html memory="4 GB" extra-reqs='
  • Python3
  • @@ -55,7 +55,7 @@ accuracy. We will create a small script that converts the above output to Vespa
    -
    +
    {% highlight python %}
     import sys
     import json
     
    @@ -68,8 +68,7 @@ for line in sys.stdin:
           **doc
         }
       }
    -  print(json.dumps(vespa_doc))
    -
    + print(json.dumps(vespa_doc)){% endhighlight %}
    @@ -100,6 +99,8 @@ A [schema](../schemas.html) is a document-type configuration; a single vespa app For this application, we define a schema `doc` which must be saved in a file named `schemas/doc.sd` in the app directory. Write the following to `app/schemas/doc.sd`: +
    +
     schema doc {
         document doc {
    @@ -148,6 +149,7 @@ schema doc {
         }
     }
     
    +
    A lot is happening here; let us go through it in detail. #### Document type and fields @@ -159,7 +161,7 @@ The [string](../reference/schema-reference.html#string) data type represents bot and there are significant differences between [index and attribute](../text-matching.html#index-and-attribute). The above schema includes default `match` modes for `attribute` and `index` property for visibility. -Note that we are enabling the usage of [BM25](../reference/bm25.html) for `title` and `text`. +Note that we are enabling [BM25](../reference/bm25.html) for `title` and `text`. by including `index: enable-bm25`. The language field is the only field not in the NFCorpus dataset. We hardcode its value to "en" since the dataset is English. Using `set_language` avoids automatic language detection and uses the value when processing the other text fields. Read more in [linguistics](../linguistics.html). @@ -171,15 +173,32 @@ add indexing/storage overhead. String fields grouped using fieldsets must share [match](../reference/schema-reference.html#match) and [linguistic processing](../linguistics.html) settings because the query processing that searches a field or fieldset uses *one* type of transformation. +#### Embedding inference +Our `embedding` field is a [tensor](../tensor-user-guide.html) with a single dense dimension of 384 values. + +``` +field embedding type tensor(v[384]) { + indexing: input title." ".input text | embed arctic | attribute + attribute { + distance-metric: angular + } + } +``` +The `indexing` expression creates the input to the `embed` inference call (in our example the concatenation of the title and the text field). Since +the dataset is small, we do not specify `index` which would build [HNSW](../approximate-nn-hnsw.html) datastructures for faster (but approximate) vector search. This guide uses [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) as the text embedding model. The model is +trained with cosine similarity, which maps to Vespa's `angular` [distance-metric](../reference/schema-reference.html#distance-metric) for +nearestNeighbor search. #### Ranking to determine matched documents ordering You can define many [rank profiles](../ranking.html), named collections of score calculations, and ranking phases. -In this basic starting point, we have a `bm25` rank-profile that uses [bm25](../reference/bm25.html). We sum the two field-level BM25 scores -using a Vespa [ranking expression](../ranking-expressions-features.html). This example uses a single [ranking phase](../phased-ranking.html). +In this starting point, we have two simple rank-profile's: +- a `bm25` rank-profile that uses [bm25](../reference/bm25.html). We sum the two field-level BM25 scores +using a Vespa [ranking expression](../ranking-expressions-features.html). +- a `semantic` rank-profile which is used in combination Vespa's nearestNeighbor query operator (vector search). -Then we have a `semantic` rank-profile which is used in combination with nearestNeighbor query operator (vector search). +Both profiles specify a single [ranking phase](../phased-ranking.html). ### Services Specification @@ -187,35 +206,38 @@ The [services.xml](../reference/services.html) defines the services that make up the Vespa application — which services to run and how many nodes per service. Write the following to `app/services.xml`: -
    -<?xml version="1.0" encoding="UTF-8"?>
    -<services version="1.0">
    -
    -    <container id="default" version="1.0">
    -        <search />
    -        <document-processing />
    -        <document-api />
    -        <component id="arctic" type="hugging-face-embedder">
    -          <transformer-model url="https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/resolve/main/onnx/model_quantized.onnx"/>
    -          <tokenizer-model url="https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/raw/main/tokenizer.json"/>
    -          <pooling-strategy>cls</pooling-strategy>
    -          <prepend>
    -            <query>Represent this sentence for searching relevant passages: </query>
    -          </prepend>
    -      </component>
    -    </container>
    -
    -    <content id="content" version="1.0">
    -        <min-redundancy>1</min-redundancy>
    -        <documents>
    -            <document type="doc" mode="index" />
    -        </documents>
    -        <nodes>
    -            <node distribution-key="0" hostalias="node1" />
    -        </nodes>
    -    </content>
    -</services>
    +
    + +
    {% highlight xml%}
    +
    +
    +    
    +        
    +        
    +        
    +        
    +          
    +          
    +          cls
    +          
    +            Represent this sentence for searching relevant passages: 
    +          
    +      
    +    
    +
    +    
    +        1
    +        
    +            
    +        
    +        
    +            
    +        
    +    
    +
    +{% endhighlight %}
     
    +
    Some notes about the elements above: @@ -249,7 +271,7 @@ $ docker run --detach --name vespa-hybrid --hostname vespa-container \
    Notice that we publish two ports: 8080 is the data-plane where we write and query documents, and 19071 is -the control-plane where we can deploy the application. +the control-plane where we can deploy the application. Note that the data-plane port is inactive before deploying the application. Configure the Vespa CLI to use the local container:
    @@ -259,7 +281,7 @@ $ vespa config set target local
    -Starting the container can take a short while. Make sure +Starting the container can take a short while. Make sure that the configuration service is running by using `vespa status`.
    @@ -281,10 +303,9 @@ $ vespa deploy --wait 300 app ## Feed the data -The data fed to Vespa must match the document type in the schema. This steps also performs embed inference inside Vespa +The data fed to Vespa must match the document type in the schema. This step performs embed inference inside Vespa using the snowflake arctic embedding model. Remember the `component` definition in `services.xml` and the `embed` call in the schema. -
    @@ -292,7 +313,7 @@ $ vespa feed -t http://localhost:8080 vespa-docs.jsonl
     
    -On an M1 we expect output like the following: +On an M1, we expect output like the following:
    {% highlight json%}
     {
    @@ -321,13 +342,13 @@ On an M1 we expect output like the following:
     
     Notice:
     
    -- `feeder.ok.rate` which is the throughput (including embedding inference). See [embedder-performance](../embedding.html#embedder-performance) for details on embedding inference performance. In this case, embedding inference is the bottleneck for overall indexing throughput. 
    -- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. The `429` are harmless and is Vespa asking the client
    -to slow down feed speed because all resources are occupied.
    +- `feeder.ok.rate` which is the throughput (Note that this step includes embedding inference). See [embedder-performance](../embedding.html#embedder-performance) for details on embedding inference performance. In this case, embedding inference is the bottleneck for overall indexing throughput. 
    +- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. The `429` are harmless. Vespa asks the client
    +to slow down the feed speed because of resource contention.
     
     
     ## Sample queries 
    -We can now run a few sample queries to demonstrate various ways to perform searches over this data using Vespa query language.
    +We can now run a few sample queries to demonstrate various ways to perform searches over this data using the [Vespa query language](../query-language.html).
     
     
    @@ -342,7 +363,7 @@ PLAIN-2 Do Cholesterol Statin Drugs Cause Breast Cancer? Here, `PLAIN-2` is the query id of the first test query. We'll use this test query to demonstrate querying Vespa. -### Sparse search using keywords with bm25 scoring +### Lexical search with BM25 scoring The following query uses [weakAnd](../using-wand-with-vespa.html) and where `targetHits` is a hint of how many documents we want to expose to configurable [ranking phases](../phased-ranking.html). Refer to [text search tutorial](text-search.html#querying-the-data) for more on querying with `userInput`. @@ -360,6 +381,7 @@ $ vespa query \
    Notice that we choose `ranking` to specify which rank profile to rank the documents retrieved by the query. +This query returns the following [JSON result response](../reference/default-result-format.html):
    {% highlight json %}
     {
    @@ -412,7 +434,7 @@ $ vespa query \
     
    -The bm25 profile calculates the relevance score: +The bm25 profile calculates the relevance score ( "relevance": 25.5..)
     rank-profile bm25 {
    @@ -422,8 +444,7 @@ rank-profile bm25 {
         }
     
    -So, in this case, `relevance` is the sum of the two BM25 scores. The retrieved document looks relevant; we can look at the graded judgment for this query `PLAIN-2`. The -following exports the query relevance judgments and we grep for the query id that we are interested in: +So, in this case, `relevance` is the sum of the two BM25 scores. The retrieved document looks relevant; we can look at the graded judgment for this query `PLAIN-2`. The following exports the query relevance judgments (we grep for the query id that we are interested in):
    @@ -432,7 +453,7 @@ $ ir_datasets export beir/nfcorpus/test qrels |grep "PLAIN-2 "
    -This lists documents judged for the query `PLAIN-2`. Notice line two, the MED-10 document is judged as very relevant with the grade 2 for the query PLAIN-2. +The following is the output from the above command. Notice line two, the `MED-10` document retrieved above, is judged as very relevant with the grade 2 for the query PLAIN-2. This dataset has graded relevance judgments where a grade of 1 is less relevant than 2.
    @@ -462,11 +483,11 @@ PLAIN-2 0 MED-4829 1
     PLAIN-2 0 MED-4830 1
     
    -### Dense search using vector search +### Dense search using text embedding Now, we turn to embedding-based retrieval, where we embed the query text using the configured text-embedding model and perform -an exact nearestNeighbor search. We use [embed query](.//embedding.html#embedding-a-query-text) to produce the -input tensor `query(e)` that was defined in the `semantic` rank-profile. +an exact `nearestNeighbor` search. We use [embed query](.//embedding.html#embedding-a-query-text) to produce the +input tensor `query(e)`, defined in the `semantic` rank-profile in the schema.
    @@ -480,7 +501,7 @@ $ vespa query \
    -This query returns the following response: +This query returns the following [JSON result response](../reference/default-result-format.html):
    {% highlight json %}
     {
    @@ -515,63 +536,86 @@ This query returns the following response:
         }
     }{% endhighlight %}
    -The result of this vector-based search differed from the previous sparse keyword search, with a different document ranked at the top. This top-ranking document, labeled as 'MED-2429', is also considered highly relevant based on the graded judgments. +The result of this vector-based search differed from the previous sparse keyword search, with a different relevant document @1. ## Evaluate ranking accuracy -Now, we looked at two ways to retrieve and rank the results. Now, we need to evaluate all 323 test queries, and then we can compare their effectiveness. +The previous section demonstrated how to combine the Vespa query language with rank-profile's +to implement two different retrieval and ranking strategies. + +In the following section we evaluate all 323 test queries with both models to compare their overall effectiveness, measured using [nDCG@10](https://en.wikipedia.org/wiki/Discounted_cumulative_gain).`nDCG@10` is the official evaluation metric of the BEIR benchmark and is an appropriate metric for test sets with graded relevance judgments. -For this we write a small script that combines ir_datasets with ir_measures. Save this to `evaluate_ranking.py` +For this evaluation task, we need to write a small script. The following script iterates over the queries in the test set, executes the query against the Vespa instance, and reads +the response from Vespa. It then evaluates and prints the metric.
    +{% highlight python %}
     import requests
    -import ir_measures
     import ir_datasets
    -from ir_measures import nDCG, P, R
    +from ir_measures import calc_aggregate, nDCG, ScoredDoc
    +from enum import Enum
    +from typing import List
     
    +class RModel(Enum):
    +    SPARSE = 1
    +    DENSE = 2
    +    HYBRID = 3
     
    -def parse_response(response, qid):
    -    run = []
    +def parse_vespa_response(response:dict, qid:str) -> List[ScoredDoc]:
    +    result = []
         hits = response['root'].get('children',[])
         for hit in hits:
    -      id = hit['fields']['doc_id']
    +      doc_id = hit['fields']['doc_id']
           relevance = hit['relevance']
    -      run.append(ir_measures.ScoredDoc(qid, id, relevance))
    -    return run
    -
    -def search(query,qid, ranking, hits=10, language="en"):    
    +      result.append(ScoredDoc(qid, doc_id, relevance))
    +    return result
    +
    +def search(query:str, qid:str, ranking:str, 
    +           hits=10, language="en", mode=RModel.SPARSE) -> List[ScoredDoc]:
    +    yql = "select doc_id from doc where ({targetHits:100}userInput(@user-query))"
    +    if mode == RModel.DENSE:
    +        yql = "select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, e))"
         query_request = {
    -        'yql': 'select doc_id from doc where ({targetHits:10}userInput(@user-query))',
    +        'yql': yql,
             'user-query': query, 
    -        'ranking': ranking,
    +        'ranking.profile': ranking,
             'hits' : hits, 
             'language': language
         }
    +    if mode == RModel.DENSE:
    +        query_request['input.query(e)'] = "embed(@user-query)"
    +
         response = requests.post("http://localhost:8080/search/", json=query_request)
         if response.ok:
    -        return parse_response(response.json(), qid)
    +        return parse_vespa_response(response.json(), qid)
         else:
           print("Search request failed with response " + str(response.json()))
           return []
     
     def main():
       dataset = ir_datasets.load("beir/nfcorpus/test")
    -  runs = []
    +  sparse_results = []
    +  dense_results = []
    +  metrics = [nDCG@10]
       for query in dataset.queries_iter():
         qid = query.query_id
         query_text = query.text
    -    run = search(query_text,qid, "bm25", hits=100)
    -    runs.extend(run)
    -  metric = ir_measures.calc_aggregate([nDCG@10, P@10, R@100], dataset.qrels, runs)
    -  print(metric)
    +    sparse_results.extend(search(query_text, qid, "bm25", mode=RModel.SPARSE))
    +    dense_results.extend(search(query_text, qid, "semantic", mode=RModel.DENSE))
    +
    +  sparse_metrics = calc_aggregate(metrics, dataset.qrels, sparse_results)
    +  dense_metrics = calc_aggregate(metrics, dataset.qrels, dense_results)
    +
    +  print("Sparse BM25: nDCG@10 {:.4f}".format(sparse_metrics[nDCG@10]))
    +  print("Dense Semantic: nDCG@10 {:.4f}".format(dense_metrics[nDCG@10]))
    +  
    +
     if __name__ == "__main__":
    -    main()
    -
    + main(){% endhighlight %}
    -Then execute the script, which runs the queries and calculates three IR-related metrics - +Then execute the script:
    @@ -579,20 +623,41 @@ $ python3 evaluate_ranking.py
     
    -## Hybrid +The script will produce the following output: + +
    +Sparse BM25: nDCG@10 0.3195
    +Dense Semantic: nDCG@10 0.3077
    +
    + +This is the *average* `nDCG@10` score across all the 327 test queries for both methods. In a real-life scenario, we would also +weight the query frequency (The set here has unique queries). You can also experiment beyond a single metric and modify the +script to calculate more [measures](https://ir-measur.es/en/latest/measures.html), for example, including precision with +a relevance label cutoff of 2: + +
    +metrics = [nDCG@10, P(rel=2)@10]
    +
    + +## Hybrid Search & Ranking + +We demonstrated and evaluated two independent retrieval and ranking strategies in the previous sections. Now, we want to explore hybrid search techniques +where we combine: + +- traditional lexical keyword matching with an unsupervised text scoring method (BM25) for two fields +- vector search using a supervised method (text embedding) for one field (a dense vector representation of a concatenation of the title and the text ). +First, we need to express how we will combine the `userInput` with `nearestNeighbor` in the Vespa query language so that we can *retrieve* using either of the methods. ## Cleanup
    -
    -$ docker rm -f vespa-hybrid
    -
    -
    - - +
    +  $ docker rm -f vespa-hybrid
    +  
    +
    [^1]: Robertson, Stephen and Zaragoza, Hugo and others, 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval. diff --git a/js/process_pre.js b/js/process_pre.js index e98c7f47f4..e67dd8d549 100644 --- a/js/process_pre.js +++ b/js/process_pre.js @@ -12,8 +12,8 @@ function processFilePREs() { let elem = elems[i]; if (elem.getAttribute("data-test") === "file") { let html = elem.innerHTML; - elem.innerHTML = html.replace(//g, "?>").replace(//g, ">"); - elem.insertAdjacentHTML("beforebegin", "
    file: " + elem.getAttribute("data-path") + "
    "); + //elem.innerHTML = html.replace(//g, "?>").replace(//g, ">"); + elem.insertAdjacentHTML("afterend", "
    Write to file: " + elem.getAttribute("data-path") + "
    "); } } }; From 411cd53aeb6e4d9e43c22b5c249e3931e7673ce8 Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Mon, 3 Jun 2024 09:45:43 +0200 Subject: [PATCH 06/10] test a go --- en/tutorials/hybrid-search.md | 158 ++++++++++++++++++++++++++++++++-- 1 file changed, 153 insertions(+), 5 deletions(-) diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index dddf4e00ba..a6028c2698 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -630,8 +630,8 @@ Sparse BM25: nDCG@10 0.3195 Dense Semantic: nDCG@10 0.3077 -This is the *average* `nDCG@10` score across all the 327 test queries for both methods. In a real-life scenario, we would also -weight the query frequency (The set here has unique queries). You can also experiment beyond a single metric and modify the +This is the *average* `nDCG@10` score across all the 327 test queries for both methods. In a real-word scenario, we would also +weigh the query frequency (The set here has unique queries). You can also experiment beyond a single metric and modify the script to calculate more [measures](https://ir-measur.es/en/latest/measures.html), for example, including precision with a relevance label cutoff of 2: @@ -644,11 +644,159 @@ metrics = [nDCG@10, P(rel=2)@10] We demonstrated and evaluated two independent retrieval and ranking strategies in the previous sections. Now, we want to explore hybrid search techniques where we combine: -- traditional lexical keyword matching with an unsupervised text scoring method (BM25) for two fields -- vector search using a supervised method (text embedding) for one field (a dense vector representation of a concatenation of the title and the text ). +- traditional lexical keyword matching with a text scoring method (BM25) +- embedding-based search using a generic text embedding model -First, we need to express how we will combine the `userInput` with `nearestNeighbor` in the Vespa query language so that we can *retrieve* using either of the methods. +With Vespa, there is a distinction between retrieval (matching) and configurable [ranking](../ranking.html). In the Vespa ranking phases, we can express arbitrary +scoring complexity with the full power of the Vespa [ranking](../ranking.html) framework. Meanwhile, top-k retrieval relies on simple built-in functions associated with Vespa's top-k query operators. +These operators aim to avoid scoring all documents in the collection for a query by using a simplistic scoring function to identify the top-k documents. +These top-k query operators use `index` structures to accelerate the query evaluation, avoiding scoring all documents using heuristics. In the context of hybrid text +search, the following Vespa top-k query operators are relevant: + +- YQL `{targetHits:k}nearestNeighbor()` for dense representations (text embeddings) using +a configured [distance-metric](reference/schema-reference.html#distance-metric) as the scoring function. +- YQL `{targetHits:k}userInput(@user-query)` which by default uses [weakAnd](../using-wand-with-vespa.html) for sparse representations + + +We can combine these using boolean query operators like AND/OR/RANK to express a hybrid search query. Then, there is a wild number of +ways that we can combine various signals in [ranking](../ranking.html). + + +### Define our first simple hybrid rank profile + +First, we can add our first simple hybrid rank profile that combines the dense and sparse components using multiplication to +combine them into a single score. + +
    +closeness(field, embedding) * (1 + bm25(title) + bm25(text))
    +
    + +- the `closeness(field, embeddding)` rank-feature returns a score in the range 0 to 1 inclusive +- Any of the per-field BM25 scores are in the range of 0 to infinity + +We add a bias constant (1) to avoid the overall score becoming 0 if the document does not match any query terms, +as the BM25 scores would be 0. We also add `match-features` to be able to debug each of the scores. + + +
    + +
    +schema doc {
    +    document doc {
    +        field language type string {
    +            indexing: "en" | set_language 
    +        }
    +        field doc_id type string {
    +            indexing: attribute | summary
    +            match: word
    +        }
    +        field title type string {
    +            indexing: index | summary
    +            match: text
    +            index: enable-bm25
    +        }
    +        field text type string {
    +            indexing: index | summary
    +            match: text
    +            index: enable-bm25
    +        }
    +    }
    +    fieldset default {
    +        fields: title, text
    +    }
    +    
    +    field embedding type tensor<bfloat16>(v[384]) {
    +      indexing: input title." ".input text | embed | attribute
    +      attribute {
    +        distance-metric: angular
    +      }
    +    }
    +  
    +    rank-profile hybrid {
    +        inputs {
    +          query(e) tensor<bfloat16>(v[384])
    +        }
    +        first-phase {
    +            expression: closeness(field, embedding) * (1 + (bm25(title) + bm25(text)))
    +        }
    +        match-features: bm25(title) bm25(text) closeness(field, embedding)
    +    }
    +}
    +
    +
    + +Now, re-deploy the Vespa application from the `app` directory: + +
    + +
    +$ vespa deploy --wait 300 app
    +
    +
    + +After that, we can start experimenting with how to express hybrid queries using the Vespa query language. + +### Hybrid query examples + +#### Hybrid query with OR operator + +
    + +
    +$ vespa query \
    +  'yql=select * from doc where ({targetHits:10}userInput(@user-query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
    +  'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \
    +  'input.query(e)=embed(@user-query)' \
    +  'hits=1' \
    +  'language=en' \
    +  'ranking=hybrid'
    +
    +
    +With this query, we express that we want to retrieve the top 10 documents that match the query using either the sparse or dense representation. Then, in the ranking phase, we determine how we score the retrieved documents, using the `hybrid` rank-profile. + +The query returns the following [JSON result response](../reference/default-result-format.html): + +
    {% highlight json %}
    +{
    +    "root": {
    +        "id": "toplevel",
    +        "relevance": 1.0,
    +        "fields": {
    +            "totalCount": 105
    +        },
    +        "coverage": {
    +            "coverage": 100,
    +            "documents": 3633,
    +            "full": true,
    +            "nodes": 1,
    +            "results": 1,
    +            "resultsFull": 1
    +        },
    +        "children": [
    +            {
    +                "id": "id:doc:doc::MED-10",
    +                "relevance": 15.898915593367988,
    +                "source": "content",
    +                "fields": {
    +                    "matchfeatures": {
    +                        "bm25(text)": 17.35556767018612,
    +                        "bm25(title)": 8.166249756144769,
    +                        "closeness(field,embedding)": 0.5994655395517325
    +                    },
    +                    "sddocname": "doc",
    +                    "documentid": "id:doc:doc::MED-10",
    +                    "doc_id": "MED-10",
    +                    "title": "Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland",
    +                    "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients."
    +                }
    +            }
    +        ]
    +    }
    +}{% endhighlight %}
    + +What is going on here is that we are combining the two top-k query operators using a boolean OR. The `totalCount` is the number of documents retrieved into +configurable ranking. The `relevance` is the hybrid score (assigned by the rank-profile `hybrid`). Notice that the `matchfeatures` field shows the individual scores. ## Cleanup From 0ae023b0c1b119c6109cc66439f12f22554250fc Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Thu, 20 Jun 2024 13:58:15 +0200 Subject: [PATCH 07/10] more test --- en/tutorials/hybrid-search.md | 284 ++++++++++++++++++++++++++++++---- 1 file changed, 258 insertions(+), 26 deletions(-) diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index a6028c2698..41abe4b44e 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -549,8 +549,7 @@ the response from Vespa. It then evaluates and prints the metric.
    -
    -{% highlight python %}
    +
    {% highlight python %}
     import requests
     import ir_datasets
     from ir_measures import calc_aggregate, nDCG, ScoredDoc
    @@ -576,6 +575,8 @@ def search(query:str, qid:str, ranking:str,
         yql = "select doc_id from doc where ({targetHits:100}userInput(@user-query))"
         if mode == RModel.DENSE:
             yql = "select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, e))"
    +    elif mode == RModel.HYBRID:
    +        yql = "select doc_id from doc where ({targetHits:100}userInput(@user-query)) OR ({targetHits:10}nearestNeighbor(embedding, e))"
         query_request = {
             'yql': yql,
             'user-query': query, 
    @@ -583,7 +584,7 @@ def search(query:str, qid:str, ranking:str,
             'hits' : hits, 
             'language': language
         }
    -    if mode == RModel.DENSE:
    +    if mode == RModel.DENSE or mode == RModel.HYBRID:
             query_request['input.query(e)'] = "embed(@user-query)"
     
         response = requests.post("http://localhost:8080/search/", json=query_request)
    @@ -594,22 +595,28 @@ def search(query:str, qid:str, ranking:str,
           return []
     
     def main():
    +  import argparse
    +  parser = argparse.ArgumentParser(description='Evaluate ranking models')
    +  parser.add_argument('--ranking', type=str, required=True, help='Vespa ranking profile')
    +  parser.add_argument('--mode', type=str, default="sparse", help='retrieval mode, valid values are sparse, dense, hybrid')
    +  args = parser.parse_args()
    +  mode = RModel.HYBRID
    +  if args.mode == "sparse":
    +    mode = RModel.SPARSE
    +  elif args.mode == "dense":
    +    mode = RModel.DENSE
    +     
    +
       dataset = ir_datasets.load("beir/nfcorpus/test")
    -  sparse_results = []
    -  dense_results = []
    +  results = []
       metrics = [nDCG@10]
       for query in dataset.queries_iter():
         qid = query.query_id
         query_text = query.text
    -    sparse_results.extend(search(query_text, qid, "bm25", mode=RModel.SPARSE))
    -    dense_results.extend(search(query_text, qid, "semantic", mode=RModel.DENSE))
    -
    -  sparse_metrics = calc_aggregate(metrics, dataset.qrels, sparse_results)
    -  dense_metrics = calc_aggregate(metrics, dataset.qrels, dense_results)
    -
    -  print("Sparse BM25: nDCG@10 {:.4f}".format(sparse_metrics[nDCG@10]))
    -  print("Dense Semantic: nDCG@10 {:.4f}".format(dense_metrics[nDCG@10]))
    -  
    +    results.extend(search(query_text, qid, args.ranking, mode=mode))
    +    
    +  metrics = calc_aggregate(metrics, dataset.qrels, results)
    +  print("Ranking metric NDCG@10 for rank profile {}: {:.4f}".format(args.ranking, metrics[nDCG@10]))
     
     if __name__ == "__main__":
         main(){% endhighlight %}
    @@ -619,21 +626,31 @@ Then execute the script:
    -$ python3 evaluate_ranking.py
    +$ python3 evaluate_ranking.py --ranking bm25 --mode sparse
     
    The script will produce the following output:
    -Sparse BM25: nDCG@10 0.3195
    -Dense Semantic: nDCG@10 0.3077
    +Ranking metric NDCG@10 for rank profile bm25: 0.3195
     
    -This is the *average* `nDCG@10` score across all the 327 test queries for both methods. In a real-word scenario, we would also -weigh the query frequency (The set here has unique queries). You can also experiment beyond a single metric and modify the -script to calculate more [measures](https://ir-measur.es/en/latest/measures.html), for example, including precision with -a relevance label cutoff of 2: +Now, we can evaluate the dense model using the same script: + +
    + +
    +$ python3 evaluate_ranking.py --ranking semantic --mode dense
    +
    +
    + +
    +Ranking metric NDCG@10 for rank profile semantic: 0.3077
    +
    +Note that the _average_ `nDCG@10` score is computed across all the 327 test queries. +You can also experiment beyond a single metric and modify the script to calculate +more [measures](https://ir-measur.es/en/latest/measures.html), for example, including precision with a relevance label cutoff of 2:
     metrics = [nDCG@10, P(rel=2)@10]
    @@ -641,11 +658,12 @@ metrics = [nDCG@10, P(rel=2)@10]
     
     ## Hybrid Search & Ranking
     
    -We demonstrated and evaluated two independent retrieval and ranking strategies in the previous sections. Now, we want to explore hybrid search techniques
    +We demonstrated and evaluated two independent retrieval and ranking strategies in the previous sections. 
    +Now, we want to explore hybrid search techniques
     where we combine:
     
     - traditional lexical keyword matching with a text scoring method (BM25) 
    -- embedding-based search using a generic text embedding model 
    +- embedding-based search using a text embedding model 
     
     With Vespa, there is a distinction between retrieval (matching) and configurable [ranking](../ranking.html). In the Vespa ranking phases, we can express arbitrary
     scoring complexity with the full power of the Vespa [ranking](../ranking.html) framework. Meanwhile, top-k retrieval relies on simple built-in functions associated with Vespa's top-k query operators.  
    @@ -738,8 +756,12 @@ $ vespa deploy --wait 300 app
     After that, we can start experimenting with how to express hybrid queries using the Vespa query language. 
     
     ### Hybrid query examples
    +The following demonstrates combining the two top-k query operators using the Vespa query language. In a later section, we will show
    +how to combine the two retrieval strategies using the Vespa ranking framework. This section focuses on the retrieval part
    +that exposes matched documents to the ranking phase(s).
     
     #### Hybrid query with OR operator
    +The following query exposes documents to ranking that match the query using *either (OR)* the sparse or dense representation. 
     
     
    @@ -753,7 +775,8 @@ $ vespa query \ 'ranking=hybrid'
    -With this query, we express that we want to retrieve the top 10 documents that match the query using either the sparse or dense representation. Then, in the ranking phase, we determine how we score the retrieved documents, using the `hybrid` rank-profile. +The documents retrieved into ranking is scored by the `hybrid` rank-profile. Note that both top-k query operators might expose more than +the the `targetHits` setting. The query returns the following [JSON result response](../reference/default-result-format.html): @@ -795,8 +818,217 @@ The query returns the following [JSON result response](../reference/default-resu } }{% endhighlight %} -What is going on here is that we are combining the two top-k query operators using a boolean OR. The `totalCount` is the number of documents retrieved into -configurable ranking. The `relevance` is the hybrid score (assigned by the rank-profile `hybrid`). Notice that the `matchfeatures` field shows the individual scores. +What is going on here is that we are combining the two top-k query operators using a boolean OR. The `totalCount` is the number of documents retrieved into ranking (About 100, which is higher than 10 + 10). The `relevance` is the score assigned by `hybrid` rank-profile. Notice that the `matchfeatures` field shows the individual scores. + +#### Hybrid query with AND operator +The following combines the two top-k operators using AND, meaning that the retrieved documents must match the sparse and dense top-k representations. + +
    + +
    +$ vespa query \
    +  'yql=select * from doc where ({targetHits:10}userInput(@user-query)) and ({targetHits:10}nearestNeighbor(embedding,e))' \
    +  'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \
    +  'input.query(e)=embed(@user-query)' \
    +  'hits=1' \
    +  'language=en' \
    +  'ranking=hybrid'
    +
    +
    + +#### Hybrid query with rank query operator +The following combines the two top-k operators using the `rank` query operator, which allows us to retrieve only the first +operand of the rank operator, but where the remaining operands allow computing query, document interaction (match) features +that can be used in ranking phases. This +query is meaningful because we can use the match features in the ranking expressions but retrieve only by the dense representation. This +is usually the most resource-effective way (fastest) to combine the two representations. + +
    + +
    +$ vespa query \
    +  'yql=select * from doc where rank(({targetHits:10}nearestNeighbor(embedding,e)), ({targetHits:10}userInput(@user-query)))' \
    +  'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \
    +  'input.query(e)=embed(@user-query)' \
    +  'hits=1' \
    +  'language=en' \
    +  'ranking=hybrid'
    +
    +
    +We can also invert the order of the operands to the `rank` query operator that retrieves by the sparse representation +but uses the dense representation to compute match features for ranking. + +
    + +
    +$ vespa query \
    +  'yql=select * from doc where rank(({targetHits:10}userInput(@user-query)),({targetHits:10}nearestNeighbor(embedding,e)))' \
    +  'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \
    +  'input.query(e)=embed(@user-query)' \
    +  'hits=1' \
    +  'language=en' \
    +  'ranking=hybrid'
    +
    +
    + +This way of performing hybrid retrieval allows retrieving only by the sparse representation and uses the dense representation to compute match features for ranking. + +## Hybrid ranking + +In the previous section, we demonstrated combining the two top-k query operators using boolean operators. This section will show combining the two retrieval strategies using the Vespa ranking framework. + + +
    + +
    +$ python3 evaluate_ranking.py --ranking hybrid --mode hybrid
    +
    +
    + +Which outputs + +
    +Ranking metric NDCG@10 for rank profile hybrid: 0.3275
    +
    + +The `nDCG@10` score is higher than the individual models. + +Now, we can experiment with more complex ranking expressions that combine the two retrieval strategies. + +
    + +
    +schema doc {
    +    document doc {
    +        field language type string {
    +            indexing: "en" | set_language 
    +        }
    +        field doc_id type string {
    +            indexing: attribute | summary
    +            match: word
    +        }
    +        field title type string {
    +            indexing: index | summary
    +            match: text
    +            index: enable-bm25
    +        }
    +        field text type string {
    +            indexing: index | summary
    +            match: text
    +            index: enable-bm25
    +        }
    +    }
    +    fieldset default {
    +        fields: title, text
    +    }
    +    
    +    field embedding type tensor<bfloat16>(v[384]) {
    +      indexing: input title." ".input text | embed | attribute
    +      attribute {
    +        distance-metric: angular
    +      }
    +    }
    +  
    +    rank-profile hybrid {
    +        inputs {
    +          query(e) tensor<bfloat16>(v[384])
    +        }
    +        first-phase {
    +            expression: closeness(field, embedding) * (1 + (bm25(title) + bm25(text)))
    +        }
    +        match-features: bm25(title) bm25(text) closeness(field, embedding)
    +    }
    +
    +    rank-profile hybrid-normalize-bm25-with-atan inherits hybrid {
    +        
    +        function scale(val) {
    +            expression: 2*atan(val/8)/(3.14159)
    +        }
    +        function normalized_bm25() {
    +            expression: scale(bm25(title) + bm25(text)) 
    +        }
    +        function cosine() {
    +            expression: cos(distance(field, embedding))
    +        }
    +        first-phase {
    +            expression: normalized_bm25 + cosine
    +        }
    +        match-features {
    +            normalized_bm25 
    +            cosine 
    +            bm25(title)
    +            bm25(text)
    +        }
    +    }
    +
    +    rank-profile hybrid-rrf inherits hybrid-normalize-bm25-with-atan{
    +
    +        function bm25_score() {
    +            expression: bm25(title) + bm25(text)
    +        }
    +        global-phase {
    +            rerank-count: 100
    +            expression: reciprocal_rank(bm25_score) + reciprocal_rank(cosine)
    +        }
    +        match-features: bm25(title) bm25(text) bm25_score cosine
    +    }
    +
    +    rank-profile hybrid-linear-normalize inherits hybrid-normalize-bm25-with-atan{
    +
    +        function bm25_score() {
    +            expression: bm25(title) + bm25(text)
    +        }
    +        global-phase {
    +            rerank-count: 100
    +            expression: normalize_linear(bm25_score) + normalize_linear(cosine)
    +        }
    +        match-features: bm25(title) bm25(text) bm25_score cosine
    +    }
    +}  
    +
    +
    + +Now, re-deploy the Vespa application from the `app` directory: + +
    + +
    +$ vespa deploy --wait 300 app
    +
    +
    + + +Then, we can evaluate the new hybrid profiles using the script: + +
    + +
    +$ python3 evaluate_ranking.py --ranking hybrid-normalize-bm25-with-atan --mode hybrid
    +
    +
    + +
    +Ranking metric NDCG@10 for rank profile hybrid-normalize-bm25-with-atan: 0.3386
    +
    + +
    + +
    +$ python3 evaluate_ranking.py --ranking hybrid-rrf --mode hybrid
    +
    +
    + +
    +Ranking metric NDCG@10 for rank profile hybrid-rrf: 0.3176
    +
    + +
    + +
    +$ python3 evaluate_ranking.py --ranking hybrid-linear-normalize --mode hybrid
    +
    +
    + ## Cleanup From 3351ea3559f6678be5e4eeb060e3514e72136b50 Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Wed, 21 Aug 2024 11:10:58 +0200 Subject: [PATCH 08/10] Add summary --- en/tutorials/hybrid-search.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index 41abe4b44e..b7b089ee72 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -939,6 +939,12 @@ schema doc { match-features: bm25(title) bm25(text) closeness(field, embedding) } + rank-profile hybrid-sum inherits hybrid { + first-phase { + expression: closeness(field, embedding) + ((bm25(title) + bm25(text))) + } + } + rank-profile hybrid-normalize-bm25-with-atan inherits hybrid { function scale(val) { @@ -1000,6 +1006,18 @@ $ vespa deploy --wait 300 app Then, we can evaluate the new hybrid profiles using the script: +
    + +
    +$ python3 evaluate_ranking.py --ranking hybrid-sum --mode hybrid
    +
    +
    + +
    +Ranking metric NDCG@10 for rank profile hybrid-sum: 0.3232
    +
    + +
    @@ -1029,6 +1047,13 @@ $ python3 evaluate_ranking.py --ranking hybrid-linear-normalize --mode hybrid
     
    +
    +Ranking metric NDCG@10 for rank profile hybrid-linear-normalize: 0.3356
    +
    + +### Summary + +In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and ranking framework. We showed how to express hybrid queries using the Vespa query language and how to combine the two retrieval strategies using the Vespa ranking framework. We also showed how to evaluate the effectiveness of the hybrid ranking model using one of the datasets that are a part of the BEIR benchmark. ## Cleanup From 9e58a817c3166374c9924c73e975d3b68f0c24f5 Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Wed, 21 Aug 2024 11:42:07 +0200 Subject: [PATCH 09/10] Fix links and some wording --- en/tutorials/hybrid-search.md | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index b7b089ee72..b12a85d46f 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -486,7 +486,7 @@ PLAIN-2 0 MED-4830 1 ### Dense search using text embedding Now, we turn to embedding-based retrieval, where we embed the query text using the configured text-embedding model and perform -an exact `nearestNeighbor` search. We use [embed query](.//embedding.html#embedding-a-query-text) to produce the +an exact `nearestNeighbor` search. We use [embed query](../embedding.html#embedding-a-query-text) to produce the input tensor `query(e)`, defined in the `semantic` rank-profile in the schema.
    @@ -673,7 +673,7 @@ These top-k query operators use `index` structures to accelerate the query evalu search, the following Vespa top-k query operators are relevant: - YQL `{targetHits:k}nearestNeighbor()` for dense representations (text embeddings) using -a configured [distance-metric](reference/schema-reference.html#distance-metric) as the scoring function. +a configured [distance-metric](../reference/schema-reference.html#distance-metric) as the scoring function. - YQL `{targetHits:k}userInput(@user-query)` which by default uses [weakAnd](../using-wand-with-vespa.html) for sparse representations @@ -835,13 +835,16 @@ $ vespa query \ 'ranking=hybrid'
    +This means that the retrieved documents must match both the sparse and dense representations. For the sparse keyword query matching, the `weakAnd` operator is used by default +and it requires that at least one term in the query matches the document (fieldset searched). #### Hybrid query with rank query operator -The following combines the two top-k operators using the `rank` query operator, which allows us to retrieve only the first -operand of the rank operator, but where the remaining operands allow computing query, document interaction (match) features -that can be used in ranking phases. This -query is meaningful because we can use the match features in the ranking expressions but retrieve only by the dense representation. This -is usually the most resource-effective way (fastest) to combine the two representations. +The following combines the two top-k operators using the [rank](../reference/query-language-reference.html#rank) query operator, which allows us to retrieve +using only the first operand of the rank operator, but where the remaining operands allow computing (match) features +that can be used in ranking phases. + +This query is meaningful because we can use the computed features in the ranking expressions but retrieve only by the dense representation. This +is usually the most resource-effective way to combine the two representations.
    @@ -856,7 +859,7 @@ $ vespa query \
    We can also invert the order of the operands to the `rank` query operator that retrieves by the sparse representation -but uses the dense representation to compute match features for ranking. +but uses the dense representation to compute features for ranking.
    @@ -871,7 +874,7 @@ $ vespa query \
    -This way of performing hybrid retrieval allows retrieving only by the sparse representation and uses the dense representation to compute match features for ranking. +This way of performing hybrid retrieval allows retrieving only by the sparse representation and uses the dense representation to compute features for ranking. ## Hybrid ranking From f6501d9e6055d29db712ff5caf96e85f9e113b7e Mon Sep 17 00:00:00 2001 From: Jo Kristian Bergum Date: Mon, 26 Aug 2024 14:07:36 +0200 Subject: [PATCH 10/10] Address review comments and add more words --- en/tutorials/hybrid-search.md | 158 +++++++++++++++++++++++----------- 1 file changed, 107 insertions(+), 51 deletions(-) diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index b12a85d46f..e31d7b036d 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -5,15 +5,20 @@ redirect_from: - /documentation/tutorials/hybrid-search.html --- -This tutorial will guide you through setting up a hybrid text search application. + +Hybrid search combines different retrieval methods to improve search quality. This tutorial distinguishes between two core components of search: + +* **Retrieval**: Identifying a subset of potentially relevant documents from a large corpus. Traditional lexical methods like [BM25](../reference/bm25.html) excel at this, as do modern, embedding-based [vector search](../vector-search.html) approaches. +* **Ranking**: Ordering retrieved documents by relevance to refine the results. Vespa's flexible [ranking framework](../ranking.html) enables complex scoring mechanisms. + +This tutorial demonstrates building a hybrid search application with Vespa that leverages the strengths of both lexical and embedding-based approaches. + We'll use the [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) dataset from the [BEIR](https://github.com/beir-cellar/beir) benchmark and explore various hybrid search techniques using Vespa's query language and ranking features. + The main goal is to set up a text search app that combines simple text scoring features such as [BM25](../reference/bm25.html) [^1] with vector search in combination with text-embedding models. We demonstrate how to obtain text embeddings within Vespa using Vespa's [embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) functionality. In this guide, we use [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) as the -text embedding model. - -For demonstration purposes, we use the small IR dataset that is part of the [BEIR](https://github.com/beir-cellar/beir) benchmark: [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/). The BEIR version of this dataset has 2590 train queries, 323 test queries, and 3633 documents. In these experiments -we only use the test queries. Later tutorials will demonstrate how to use the train split to learn how to rank documents. +text embedding model. It is a small model that is fast to run and has a small memory footprint. {% include pre-req.html memory="4 GB" extra-reqs='
  • Python3
  • @@ -22,11 +27,11 @@ we only use the test queries. Later tutorials will demonstrate how to use the tr ## Installing vespa-cli and ir_datasets This tutorial uses [Vespa-CLI](../vespa-cli.html) to deploy, feed, and query Vespa. We also use -[ir-datasets](https://ir-datasets.com/) to obtain the dataset. +[ir-datasets](https://ir-datasets.com/) to obtain the NFCorpus relevance dataset.
    -$ pip3 install --ignore-installed vespacli ir_datasets ir_measures
    +$ pip3 install --ignore-installed vespacli ir_datasets ir_measures requests
     
    @@ -45,13 +50,13 @@ Which outputs: {"doc_id": "MED-10", "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995\u20132003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08\u20139.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38\u20130.55 and HR 0.54, 95% CI 0.44\u20130.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins\u2019 effect on survival in breast cancer patients.", "title": "Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland", "url": "http://www.ncbi.nlm.nih.gov/pubmed/25329299"} {% endhighlight %} -The NFCorpus documents have four fields +The NFCorpus documents have four fields: - The `doc_id` and `url` - The `text` and the `title` We are interested in the title and the text, and we want to be able to search across these two fields. We also need to store the `doc_id` to evaluate [ranking](../ranking.html) -accuracy. We will create a small script that converts the above output to Vespa JSON feed format. Create a `convert.py` file: +accuracy. We will create a small script that converts the above output to [Vespa JSON document](../reference/document-json-format.html) format. Create a `convert.py` file:
    @@ -72,7 +77,8 @@ for line in sys.stdin:
    -Then we can export the documents using ir_datasets and pipe it to the `convert.py` script: +With this script, we convert the document dump to Vespa JSON format. Use +the following command to convert the entire dataset to Vespa JSON format:
    @@ -81,6 +87,7 @@ $ ir_datasets export beir/nfcorpus docs --format jsonl | python3 convert.py > ve
    +Now, we will create the Vespa application package and schema to index the documents. ## Create a Vespa Application Package @@ -96,7 +103,8 @@ $ mkdir -p app/schemas ### Schema A [schema](../schemas.html) is a document-type configuration; a single vespa application can have multiple schemas with document types. -For this application, we define a schema `doc` which must be saved in a file named `schemas/doc.sd` in the app directory. +For this application, we define a schema `doc`, which must be saved in a file named `schemas/doc.sd` in the application package directory. + Write the following to `app/schemas/doc.sd`:
    @@ -162,7 +170,7 @@ and there are significant differences between [index and attribute](../text-matc schema includes default `match` modes for `attribute` and `index` property for visibility. Note that we are enabling [BM25](../reference/bm25.html) for `title` and `text`. -by including `index: enable-bm25`. The language field is the only field not in the NFCorpus dataset. +by including `index: enable-bm25`. The language field is the only field that is not the NFCorpus dataset. We hardcode its value to "en" since the dataset is English. Using `set_language` avoids automatic language detection and uses the value when processing the other text fields. Read more in [linguistics](../linguistics.html). @@ -174,7 +182,7 @@ add indexing/storage overhead. String fields grouped using fieldsets must share the query processing that searches a field or fieldset uses *one* type of transformation. #### Embedding inference -Our `embedding` field is a [tensor](../tensor-user-guide.html) with a single dense dimension of 384 values. +Our `embedding` vector field is of [tensor](../tensor-user-guide.html) type with a single named dimension (`v`) of 384 values. ``` field embedding type tensor(v[384]) { @@ -230,9 +238,6 @@ Write the following to `app/services.xml`: - - - {% endhighlight %} @@ -245,14 +250,11 @@ Some notes about the elements above: - `` sets up the [query endpoint](../query-api.html). The default port is 8080. - `` sets up the [document endpoint](../reference/document-v1-api-reference.html) for feeding. - `component` with type `hugging-face-embedder` configures the embedder in the application package. This include where to fetch the model files from, the prepend -instructions, and the pooling strategy. +instructions, and the pooling strategy. See [huggingface-embedder](../embedding.html#huggingface-embedder) for details and other embedders supported. - `` defines how documents are stored and searched - `` denotes how many copies to keep of each document. - `` assigns the document types in the _schema_ to content clusters — - the content cluster capacity can be increased by adding node elements — - see [elasticity](../elasticity.html). - (See also the [reference](../reference/services-content.html) for more on content cluster setup.) -- `` defines the hosts for the content cluster. + ## Deploy the application package @@ -353,7 +355,7 @@ We can now run a few sample queries to demonstrate various ways to perform searc
    -$ ir_datasets export beir/nfcorpus/test queries | head -1
    +$ ir_datasets export beir/nfcorpus/test queries --fields query_id text |head -1
     
    @@ -364,6 +366,7 @@ PLAIN-2 Do Cholesterol Statin Drugs Cause Breast Cancer? Here, `PLAIN-2` is the query id of the first test query. We'll use this test query to demonstrate querying Vespa. ### Lexical search with BM25 scoring + The following query uses [weakAnd](../using-wand-with-vespa.html) and where `targetHits` is a hint of how many documents we want to expose to configurable [ranking phases](../phased-ranking.html). Refer to [text search tutorial](text-search.html#querying-the-data) for more on querying with `userInput`. @@ -419,8 +422,9 @@ This query returns the following [JSON result response](../reference/default-res {% endhighlight %} The query retrieves and ranks `MED-10` as the most relevant document—notice the `totalCount` which is the number of documents that were retrieved for ranking -phases. In this case, we exposed 65 documents, it is higher than our target, but also much fewer than the total number of documents that match any query terms like below, changing the -grammar from the default `weakAnd` to `any` matches 1780, or almost 50% of the indexed documents. +phases. In this case, we exposed 65 documents to first-phase ranking, it is higher than our target, but also fewer than the total number of documents that match any query terms. + +In the example below, we change the grammar from the default `weakAnd` to `any`, and the query matches 1780, or almost 50% of the indexed documents.
    @@ -434,7 +438,7 @@ $ vespa query \
    -The bm25 profile calculates the relevance score ( "relevance": 25.5..) +The bm25 rank profile calculates the relevance score ( "relevance": 25.5..), this was configured in the schema as:
     rank-profile bm25 {
    @@ -453,8 +457,8 @@ $ ir_datasets export beir/nfcorpus/test qrels |grep "PLAIN-2 "
     
    -The following is the output from the above command. Notice line two, the `MED-10` document retrieved above, is judged as very relevant with the grade 2 for the query PLAIN-2. -This dataset has graded relevance judgments where a grade of 1 is less relevant than 2. +The following is the output from the above command. Notice line two, the `MED-10` document retrieved above, is judged as very relevant with the grade 2 (perfect) for the query_id PLAIN-2. +This dataset has graded relevance judgments where a grade of 1 is less relevant than 2. Documents retrieved by the system without a relevance judgment are assumed to be irrelevant (grade 0).
     PLAIN-2 0 MED-2427 2
    @@ -536,16 +540,34 @@ This query returns the following [JSON result response](../reference/default-res
         }
     }{% endhighlight %}
    -The result of this vector-based search differed from the previous sparse keyword search, with a different relevant document @1. +The result of this vector-based search differed from the previous sparse keyword search, with a different relevant document at position 1. In this case, +the relevance score is 0.606 and calculated by the `closeness` function in the `semantic` rank-profile. + +``` +rank-profile semantic { + inputs { + query(e) tensor(v[384]) + } + first-phase { + expression: closeness(field, embedding) + } + } +``` + +Where [closeness(field, embedding)](../reference/rank-features.html#attribute-match-features-normalized) is a ranking feature that calculates the cosine similarity between the query and the document embedding. This returns the inverted of the distance between the two vectors. Small distance = higher closeness. This because Vespa sorts results in descending order of relevance. +Descending order means the largest will appear at the top of the ranked list. + +Note that similarity scores of embedding vectors are often optimized via contrastive or ranking losses, which make them difficult to interpret. ## Evaluate ranking accuracy -The previous section demonstrated how to combine the Vespa query language with rank-profile's + +The previous section demonstrated how to combine the Vespa query language with rank profiles to to implement two different retrieval and ranking strategies. -In the following section we evaluate all 323 test queries with both models to compare their overall effectiveness, measured using [nDCG@10](https://en.wikipedia.org/wiki/Discounted_cumulative_gain).`nDCG@10` is the official evaluation metric of the BEIR benchmark and is an appropriate metric for test sets with graded relevance judgments. +In the following section we evaluate all 323 test queries with both models to compare their overall effectiveness, measured using [nDCG@10](https://en.wikipedia.org/wiki/Discounted_cumulative_gain). `nDCG@10` is the official evaluation metric of the BEIR benchmark and is an appropriate metric for test sets with graded relevance judgments. For this evaluation task, we need to write a small script. The following script iterates over the queries in the test set, executes the query against the Vespa instance, and reads -the response from Vespa. It then evaluates and prints the metric. +the response from Vespa. It then evaluates and prints the metric. The overall effectiveness is measured using the average of each query `nDCG@10` metric.
    @@ -659,15 +681,16 @@ metrics = [nDCG@10, P(rel=2)@10] ## Hybrid Search & Ranking We demonstrated and evaluated two independent retrieval and ranking strategies in the previous sections. -Now, we want to explore hybrid search techniques -where we combine: +Now, we want to explore hybrid search techniques where we combine: - traditional lexical keyword matching with a text scoring method (BM25) - embedding-based search using a text embedding model -With Vespa, there is a distinction between retrieval (matching) and configurable [ranking](../ranking.html). In the Vespa ranking phases, we can express arbitrary -scoring complexity with the full power of the Vespa [ranking](../ranking.html) framework. Meanwhile, top-k retrieval relies on simple built-in functions associated with Vespa's top-k query operators. -These operators aim to avoid scoring all documents in the collection for a query by using a simplistic scoring function to identify the top-k documents. +With Vespa, there is a distinction between retrieval (matching) and configurable [ranking](../ranking.html). + +In the Vespa ranking phases, we can express arbitrary scoring complexity with the full power of the Vespa [ranking](../ranking.html) framework. +Meanwhile, top-k retrieval relies on simple built-in functions associated with Vespa's top-k query operators. +These top-k operators aim to avoid scoring all documents in the collection for a query by using a simplistic scoring function to identify the top-k documents. These top-k query operators use `index` structures to accelerate the query evaluation, avoiding scoring all documents using heuristics. In the context of hybrid text search, the following Vespa top-k query operators are relevant: @@ -677,7 +700,7 @@ a configured [distance-metric](../reference/schema-reference.html#distance-metri - YQL `{targetHits:k}userInput(@user-query)` which by default uses [weakAnd](../using-wand-with-vespa.html) for sparse representations -We can combine these using boolean query operators like AND/OR/RANK to express a hybrid search query. Then, there is a wild number of +We can combine these operators using boolean query operators like AND/OR/RANK to express a hybrid search query. Then, there is a wild number of ways that we can combine various signals in [ranking](../ranking.html). @@ -690,7 +713,7 @@ combine them into a single score. closeness(field, embedding) * (1 + bm25(title) + bm25(text)) -- the `closeness(field, embeddding)` rank-feature returns a score in the range 0 to 1 inclusive +- the [closeness(field, embedding)](../reference/rank-features.html#attribute-match-features-normalized) rank-feature returns a normalized score in the range 0 to 1 inclusive - Any of the per-field BM25 scores are in the range of 0 to infinity We add a bias constant (1) to avoid the overall score becoming 0 if the document does not match any query terms, @@ -757,10 +780,10 @@ After that, we can start experimenting with how to express hybrid queries using ### Hybrid query examples The following demonstrates combining the two top-k query operators using the Vespa query language. In a later section, we will show -how to combine the two retrieval strategies using the Vespa ranking framework. This section focuses on the retrieval part -that exposes matched documents to the ranking phase(s). +how to combine the two retrieval strategies using the Vespa ranking framework. This section focuses on the top-k retrieval part +that exposes matched documents to the Vespa [ranking](../ranking.html) phase(s). -#### Hybrid query with OR operator +#### Hybrid query using the OR operator The following query exposes documents to ranking that match the query using *either (OR)* the sparse or dense representation.
    @@ -778,7 +801,7 @@ $ vespa query \ The documents retrieved into ranking is scored by the `hybrid` rank-profile. Note that both top-k query operators might expose more than the the `targetHits` setting. -The query returns the following [JSON result response](../reference/default-result-format.html): +The above query returns the following [JSON result response](../reference/default-result-format.html):
    {% highlight json %}
     {
    @@ -818,10 +841,13 @@ The query returns the following [JSON result response](../reference/default-resu
         }
     }{% endhighlight %}
    -What is going on here is that we are combining the two top-k query operators using a boolean OR. The `totalCount` is the number of documents retrieved into ranking (About 100, which is higher than 10 + 10). The `relevance` is the score assigned by `hybrid` rank-profile. Notice that the `matchfeatures` field shows the individual scores. +What is going on here is that we are combining the two top-k query operators using a boolean OR (disjunection). +The `totalCount` is the number of documents retrieved into ranking (About 100, which is higher than 10 + 10). +The `relevance` is the score assigned by `hybrid` rank-profile. Notice that the `matchfeatures` field shows all the feature scores. This is +useful for debugging and understanding the ranking behavior, also for feature logging. #### Hybrid query with AND operator -The following combines the two top-k operators using AND, meaning that the retrieved documents must match the sparse and dense top-k representations. +The following combines the two top-k operators using AND, meaning that the retrieved documents must match both the sparse and dense top-k operators.
    @@ -859,7 +885,8 @@ $ vespa query \
    We can also invert the order of the operands to the `rank` query operator that retrieves by the sparse representation -but uses the dense representation to compute features for ranking. +but uses the dense representation to compute features for ranking. This is very useful in cases where we do not want +to build HNSW indexes (adds memory and slows down indexing), but still be able to use semantic signals in ranking phases.
    @@ -874,11 +901,15 @@ $ vespa query \
    -This way of performing hybrid retrieval allows retrieving only by the sparse representation and uses the dense representation to compute features for ranking. +This way of performing hybrid retrieval allows retrieving only by the sparse representation and uses the dense vector +representation to compute features for ranking. ## Hybrid ranking -In the previous section, we demonstrated combining the two top-k query operators using boolean operators. This section will show combining the two retrieval strategies using the Vespa ranking framework. +In the previous section, we demonstrated combining the two top-k query operators using boolean query operators. + +This section will show combining the two retrieval strategies using the Vespa ranking framework. We can first start evaluating +the effectiveness of the hybrid rank profile that combines the two retrieval strategies.
    @@ -894,9 +925,10 @@ Which outputs Ranking metric NDCG@10 for rank profile hybrid: 0.3275 -The `nDCG@10` score is higher than the individual models. +The `nDCG@10` score is slightly higher than the profiles that only use one of the ranking strategies. -Now, we can experiment with more complex ranking expressions that combine the two retrieval strategies. +Now, we can experiment with more complex ranking expressions that combine the two retrieval strategies. W +e add a few more rank profiles to the schema that combine the two retrieval strategies in different ways.
    @@ -1006,8 +1038,21 @@ $ vespa deploy --wait 300 app
    +Let us break down the new rank profiles: -Then, we can evaluate the new hybrid profiles using the script: +- `hybrid-sum` combines the two retrieval strategies using addition. This is a simple way to combine the two strategies. But since the BM25 scores are not normalized (unbound) +and the closeness score is normalized (0-1), the BM25 scores will dominate the closeness score. +- `hybrid-normalize-bm25-with-atan` combines the two strategies using a normalized BM25 score and the cosine similarity. The BM25 scores are normalized using the `atan` function. +- `hybrid-rrf` combines the two strategies using the reciprocal rank feature. This is a way to combine the two strategies using a reciprocal rank feature. +- `hybrid-linear-normalize` combines the two strategies using a linear normalization function. This is a way to combine the two strategies using a linear normalization function. + +The two last profiles are using `global-phase` to rerank the top 100 documents using the reciprocal rank and linear normalization functions. This can only be done in the global phase +as it requires access to all the documents that are retrieved into ranking and in a multi-node setup, this requires communication between the nodes and knowledge of +the score distribution across all the nodes. In addition, each ranking phase can only order the documents by a single score. + +### Evaluate the new rank profiles + +Adding new rank-profiles is a hot change. Once we have deployed the application, we can evaluate the new hybrid profiles using the script:
    @@ -1054,9 +1099,20 @@ $ python3 evaluate_ranking.py --ranking hybrid-linear-normalize --mode hybrid Ranking metric NDCG@10 for rank profile hybrid-linear-normalize: 0.3356 +On this particular dataset, the `hybrid-normalize-bm25-with-atan` rank profile performs the best, but the difference is small. This also demonstrates that hybrid search +and ranking is a complex problem and that the effectiveness of the hybrid model depends on the dataset and the retrieval strategies. + +These results (which is the best) might not +transfer to your specific retrieval use case and dataset, so it is important to evaluate the effectiveness of a hybrid model on your specific dataset and having +your own relevance judgments. + +See [Improving retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for more information on how to collect relevance judgments for your dataset. + ### Summary -In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and ranking framework. We showed how to express hybrid queries using the Vespa query language and how to combine the two retrieval strategies using the Vespa ranking framework. We also showed how to evaluate the effectiveness of the hybrid ranking model using one of the datasets that are a part of the BEIR benchmark. +In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and how to expression hybriding ranking using the Vespa ranking framework. + +We showed how to express hybrid queries using the Vespa query language and how to combine the two retrieval strategies using the Vespa ranking framework. We also showed how to evaluate the effectiveness of the hybrid ranking model using one of the datasets that are a part of the BEIR benchmark. We hope this tutorial has given you a good understanding of how to combine different retrieval strategies using Vespa, and that there is not a single silver bullet for all retrieval problems. ## Cleanup