diff --git a/css/style.scss b/css/style.scss index bfc908c09f..e2cf45852a 100644 --- a/css/style.scss +++ b/css/style.scss @@ -500,6 +500,24 @@ table { background-color: $color-brand-300; } +.filepath { + -webkit-font-smoothing: auto; + overflow: auto; + page-break-inside: avoid; + display: block; + line-height: 1.42857143; + color: #d0d0d0; + word-break: break-all; + word-wrap: break-word; + border: 1px solid #404040; + white-space: pre-wrap; + background-color: #404040; + border-radius: 4px; + padding: 14px; + margin-bottom: 10px; + margin-top: 10px; +} + /* Query results */ .search-result-list { @@ -610,4 +628,4 @@ blockquote { background-color: #e8e8e8; padding: 5px 5px 5px 5px; border-radius: 5px; -} \ No newline at end of file +} diff --git a/en/tutorials/hybrid-search.md b/en/tutorials/hybrid-search.md index 10a89998d9..af4e3bc8de 100644 --- a/en/tutorials/hybrid-search.md +++ b/en/tutorials/hybrid-search.md @@ -5,15 +5,20 @@ redirect_from: - /documentation/tutorials/hybrid-search.html --- -This tutorial will guide you through setting up a hybrid text search application. + +Hybrid search combines different retrieval methods to improve search quality. This tutorial distinguishes between two core components of search: + +* **Retrieval**: Identifying a subset of potentially relevant documents from a large corpus. Traditional lexical methods like [BM25](../reference/bm25.html) excel at this, as do modern, embedding-based [vector search](../vector-search.html) approaches. +* **Ranking**: Ordering retrieved documents by relevance to refine the results. Vespa's flexible [ranking framework](../ranking.html) enables complex scoring mechanisms. + +This tutorial demonstrates building a hybrid search application with Vespa that leverages the strengths of both lexical and embedding-based approaches. + We'll use the [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) dataset from the [BEIR](https://github.com/beir-cellar/beir) benchmark and explore various hybrid search techniques using Vespa's query language and ranking features. + The main goal is to set up a text search app that combines simple text scoring features -such as [BM25](../reference/bm25.html) [^1] with vector search in combination with text-embedding models. We -demonstrate obtaining the text embeddings within Vespa using Vespa's [embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) +such as [BM25](../reference/bm25.html) [^1] with vector search in combination with text-embedding models. +We demonstrate how to obtain text embeddings within Vespa using Vespa's [embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) functionality. In this guide, we use [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) as the -text embedding model. - -For demonstration purposes, we use the small IR dataset that is part of the [BEIR](https://github.com/beir-cellar/beir) benchmark: [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/). The BEIR version has 2590 train queries, 323 test queries, and 3633 documents. In these experiments -we only use the test queries to evaluate various hybrid search techniques. Later tutorials will demonstrate how to use the train split to learn how to rank documents. +text embedding model. It is a small model that is fast to run and has a small memory footprint. {% include pre-req.html memory="4 GB" extra-reqs='
-$ pip3 install --ignore-installed vespacli ir_datasets +$ pip3 install --ignore-installed vespacli ir_datasets ir_measures requests
+{% highlight python %} import sys import json @@ -68,12 +73,12 @@ for line in sys.stdin: **doc } } - print(json.dumps(vespa_doc)) -+ print(json.dumps(vespa_doc)){% endhighlight %}
schema doc { document doc { @@ -116,7 +125,7 @@ schema doc { index: enable-bm25 } field text type string { - indexing: index + indexing: index | summary match: text index: enable-bm25 } @@ -126,8 +135,8 @@ schema doc { } field embedding type tensor<bfloat16>(v[384]) { - indexing: input title . " " . input text | embed | summary | attribute - attribute: { + indexing: input title." ".input text | embed | attribute + attribute { distance-metric: angular } } @@ -148,6 +157,7 @@ schema doc { } }+
-<?xml version="1.0" encoding="UTF-8"?> -<services version="1.0"> - - <container id="default" version="1.0"> - <search /> - <document-processing /> - <document-api /> - <component id="arctic" type="hugging-face-embedder"> - <transformer-model url="https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/resolve/main/onnx/model_quantized.onnx"/> - <tokenizer-model url="https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/raw/main/tokenizer.json"/> - <pooling>cls</pooling> - <prepend> - <query>Represent this sentence for searching relevant passages: </query> - </prepend> - </component> - </container> - - <content id="content" version="1.0"> - <min-redundancy>1</min-redundancy> - <documents> - <document type="doc" mode="index" /> - </documents> - <nodes> - <node distribution-key="0" hostalias="node1" /> - </nodes> - </content> -</services> +-Notice that we publish two ports (:8080) is the data-plane port where we write and query documents, and 19071 is -the control-plane where we can deploy the application. +Notice that we publish two ports: 8080 is the data-plane where we write and query documents, and 19071 is +the control-plane where we can deploy the application. Note that the data-plane port is inactive before deploying the application. Configure the Vespa CLI to use the local container:+ +Some notes about the elements above: @@ -224,14 +251,11 @@ Some notes about the elements above: - `{% highlight xml%} +++ + +{% endhighlight %}+ + ++ + + + ++ + cls ++ +Represent this sentence for searching relevant passages: ++ +1 ++ ++ ` sets up the [query endpoint](../query-api.html). The default port is 8080. - ` ` sets up the [document endpoint](../reference/document-v1-api-reference.html) for feeding. - `component` with type `hugging-face-embedder` configures the embedder in the application package. This include where to fetch the model files from, the prepend -instructions, and the pooling strategy. +instructions, and the pooling strategy. See [huggingface-embedder](../embedding.html#huggingface-embedder) for details and other embedders supported. - ` ` defines how documents are stored and searched - ` ` denotes how many copies to keep of each document. - ` ` assigns the document types in the _schema_ to content clusters — - the content cluster capacity can be increased by adding node elements — - see [elasticity](../elasticity.html). - (See also the [reference](../reference/services-content.html) for more on content cluster setup.) -- ` ` defines the hosts for the content cluster. + ## Deploy the application package @@ -249,8 +273,8 @@ $ docker run --detach --name vespa-hybrid --hostname vespa-container \
+$ vespa deploy --wait 300 app
{% highlight json%} +{ + "feeder.operation.count": 3633, + "feeder.seconds": 39.723, + "feeder.ok.count": 3633, + "feeder.ok.rate": 91.459, + "feeder.error.count": 0, + "feeder.inflight.count": 0, + "http.request.count": 13157, + "http.request.bytes": 21102792, + "http.request.MBps": 0.531, + "http.exception.count": 0, + "http.response.count": 13157, + "http.response.bytes": 1532828, + "http.response.MBps": 0.039, + "http.response.error.count": 9524, + "http.response.latency.millis.min": 0, + "http.response.latency.millis.avg": 1220, + "http.response.latency.millis.max": 13703, + "http.response.code.counts": { + "200": 3633, + "429": 9524 + } +}{% endhighlight %}+ +Notice: + +- `feeder.ok.rate` which is the throughput (Note that this step includes embedding inference). See [embedder-performance](../embedding.html#embedder-performance) for details on embedding inference performance. In this case, embedding inference is the bottleneck for overall indexing throughput. +- `http.response.code.counts` matches with `feeder.ok.count` - The dataset has 3633 documents. The `429` are harmless. Vespa asks the client +to slow down the feed speed because of resource contention. + + +## Sample queries +We can now run a few sample queries to demonstrate various ways to perform searches over this data using the [Vespa query language](../query-language.html). + +
+$ ir_datasets export beir/nfcorpus/test queries --fields query_id text |head -1 ++
+PLAIN-2 Do Cholesterol Statin Drugs Cause Breast Cancer? ++ +Here, `PLAIN-2` is the query id of the first test query. We'll use this test query to demonstrate querying Vespa. + +### Lexical search with BM25 scoring + +The following query uses [weakAnd](../using-wand-with-vespa.html) and where `targetHits` is a hint +of how many documents we want to expose to configurable [ranking phases](../phased-ranking.html). Refer +to [text search tutorial](text-search.html#querying-the-data) for more on querying with `userInput`. + +
+$ vespa query \ + 'yql=select * from doc where {targetHits:10}userInput(@user-query)' \ + 'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \ + 'hits=1' \ + 'language=en' \ + 'ranking=bm25' ++
{% highlight json %} +{ + "root": { + "id": "toplevel", + "relevance": 1.0, + "fields": { + "totalCount": 65 + }, + "coverage": { + "coverage": 100, + "documents": 3633, + "full": true, + "nodes": 1, + "results": 1, + "resultsFull": 1 + }, + "children": [ + { + "id": "id:doc:doc::MED-10", + "relevance": 25.521817426330887, + "source": "content", + "fields": { + "sddocname": "doc", + "documentid": "id:doc:doc::MED-10", + "doc_id": "MED-10", + "title": "Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland", + "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients." + } + } + ] + } +} + +{% endhighlight %}+ +The query retrieves and ranks `MED-10` as the most relevant document—notice the `totalCount` which is the number of documents that were retrieved for ranking +phases. In this case, we exposed 65 documents to first-phase ranking, it is higher than our target, but also fewer than the total number of documents that match any query terms. + +In the example below, we change the grammar from the default `weakAnd` to `any`, and the query matches 1780, or almost 50% of the indexed documents. + +
+$ vespa query \ + 'yql=select * from doc where {targetHits:10, grammar:"any"}userInput(@user-query)' \ + 'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \ + 'hits=1' \ + 'language=en' \ + 'ranking=bm25' ++
+rank-profile bm25 { + first-phase { + expression: bm25(title) + bm25(text) + } + } ++ +So, in this case, `relevance` is the sum of the two BM25 scores. The retrieved document looks relevant; we can look at the graded judgment for this query `PLAIN-2`. The following exports the query relevance judgments (we grep for the query id that we are interested in): + +
+$ ir_datasets export beir/nfcorpus/test qrels |grep "PLAIN-2 " ++
+PLAIN-2 0 MED-2427 2 +PLAIN-2 0 MED-10 2 +PLAIN-2 0 MED-2429 2 +PLAIN-2 0 MED-2430 2 +PLAIN-2 0 MED-2431 2 +PLAIN-2 0 MED-14 2 +PLAIN-2 0 MED-2432 2 +PLAIN-2 0 MED-2428 1 +PLAIN-2 0 MED-2440 1 +PLAIN-2 0 MED-2434 1 +PLAIN-2 0 MED-2435 1 +PLAIN-2 0 MED-2436 1 +PLAIN-2 0 MED-2437 1 +PLAIN-2 0 MED-2438 1 +PLAIN-2 0 MED-2439 1 +PLAIN-2 0 MED-3597 1 +PLAIN-2 0 MED-3598 1 +PLAIN-2 0 MED-3599 1 +PLAIN-2 0 MED-4556 1 +PLAIN-2 0 MED-4559 1 +PLAIN-2 0 MED-4560 1 +PLAIN-2 0 MED-4828 1 +PLAIN-2 0 MED-4829 1 +PLAIN-2 0 MED-4830 1 ++ +### Dense search using text embedding + +Now, we turn to embedding-based retrieval, where we embed the query text using the configured text-embedding model and perform +an exact `nearestNeighbor` search. We use [embed query](../embedding.html#embedding-a-query-text) to produce the +input tensor `query(e)`, defined in the `semantic` rank-profile in the schema. + +
+$ vespa query \ + 'yql=select * from doc where {targetHits:10}nearestNeighbor(embedding,e)' \ + 'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \ + 'input.query(e)=embed(@user-query)' \ + 'hits=1' \ + 'ranking=semantic' ++
{% highlight json %} +{ + "root": { + "id": "toplevel", + "relevance": 1.0, + "fields": { + "totalCount": 64 + }, + "coverage": { + "coverage": 100, + "documents": 3633, + "full": true, + "nodes": 1, + "results": 1, + "resultsFull": 1 + }, + "children": [ + { + "id": "id:doc:doc::MED-2429", + "relevance": 0.6061378635706601, + "source": "content", + "fields": { + "sddocname": "doc", + "documentid": "id:doc:doc::MED-2429", + "doc_id": "MED-2429", + "title": "Statin use and risk of breast cancer: a meta-analysis of observational studies.", + "text": "Emerging evidence suggests that statins' may decrease the risk of cancers. However, available evidence on breast cancer is conflicting. We, therefore, examined the association between statin use and risk of breast cancer by conducting a detailed meta-analysis of all observational studies published regarding this subject. PubMed database and bibliographies of retrieved articles were searched for epidemiological studies published up to January 2012, investigating the relationship between statin use and breast cancer. Before meta-analysis, the studies were evaluated for publication bias and heterogeneity. Combined relative risk (RR) and 95 % confidence interval (CI) were calculated using a random-effects model (DerSimonian and Laird method). Subgroup analyses, sensitivity analysis, and cumulative meta-analysis were also performed. A total of 24 (13 cohort and 11 case-control) studies involving more than 2.4 million participants, including 76,759 breast cancer cases contributed to this analysis. We found no evidence of publication bias and evidence of heterogeneity among the studies. Statin use and long-term statin use did not significantly affect breast cancer risk (RR = 0.99, 95 % CI = 0.94, 1.04 and RR = 1.03, 95 % CI = 0.96, 1.11, respectively). When the analysis was stratified into subgroups, there was no evidence that study design substantially influenced the effect estimate. Sensitivity analysis confirmed the stability of our results. Cumulative meta-analysis showed a change in trend of reporting risk of breast cancer from positive to negative in statin users between 1993 and 2011. Our meta-analysis findings do not support the hypothesis that statins' have a protective effect against breast cancer. More randomized clinical trials and observational studies are needed to confirm this association with underlying biological mechanisms in the future." + } + } + ] + } +}{% endhighlight %}+ +The result of this vector-based search differed from the previous sparse keyword search, with a different relevant document at position 1. In this case, +the relevance score is 0.606 and calculated by the `closeness` function in the `semantic` rank-profile. + +``` +rank-profile semantic { + inputs { + query(e) tensor
{% highlight python %} +import requests +import ir_datasets +from ir_measures import calc_aggregate, nDCG, ScoredDoc +from enum import Enum +from typing import List + +class RModel(Enum): + SPARSE = 1 + DENSE = 2 + HYBRID = 3 + +def parse_vespa_response(response:dict, qid:str) -> List[ScoredDoc]: + result = [] + hits = response['root'].get('children',[]) + for hit in hits: + doc_id = hit['fields']['doc_id'] + relevance = hit['relevance'] + result.append(ScoredDoc(qid, doc_id, relevance)) + return result + +def search(query:str, qid:str, ranking:str, + hits=10, language="en", mode=RModel.SPARSE) -> List[ScoredDoc]: + yql = "select doc_id from doc where ({targetHits:100}userInput(@user-query))" + if mode == RModel.DENSE: + yql = "select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, e))" + elif mode == RModel.HYBRID: + yql = "select doc_id from doc where ({targetHits:100}userInput(@user-query)) OR ({targetHits:10}nearestNeighbor(embedding, e))" + query_request = { + 'yql': yql, + 'user-query': query, + 'ranking.profile': ranking, + 'hits' : hits, + 'language': language + } + if mode == RModel.DENSE or mode == RModel.HYBRID: + query_request['input.query(e)'] = "embed(@user-query)" + + response = requests.post("http://localhost:8080/search/", json=query_request) + if response.ok: + return parse_vespa_response(response.json(), qid) + else: + print("Search request failed with response " + str(response.json())) + return [] + +def main(): + import argparse + parser = argparse.ArgumentParser(description='Evaluate ranking models') + parser.add_argument('--ranking', type=str, required=True, help='Vespa ranking profile') + parser.add_argument('--mode', type=str, default="sparse", help='retrieval mode, valid values are sparse, dense, hybrid') + args = parser.parse_args() + mode = RModel.HYBRID + if args.mode == "sparse": + mode = RModel.SPARSE + elif args.mode == "dense": + mode = RModel.DENSE + + + dataset = ir_datasets.load("beir/nfcorpus/test") + results = [] + metrics = [nDCG@10] + for query in dataset.queries_iter(): + qid = query.query_id + query_text = query.text + results.extend(search(query_text, qid, args.ranking, mode=mode)) + + metrics = calc_aggregate(metrics, dataset.qrels, results) + print("Ranking metric NDCG@10 for rank profile {}: {:.4f}".format(args.ranking, metrics[nDCG@10])) + +if __name__ == "__main__": + main(){% endhighlight %}+
+$ python3 evaluate_ranking.py --ranking bm25 --mode sparse ++
+Ranking metric NDCG@10 for rank profile bm25: 0.3195 ++ +Now, we can evaluate the dense model using the same script: + +
+$ python3 evaluate_ranking.py --ranking semantic --mode dense ++
+Ranking metric NDCG@10 for rank profile semantic: 0.3077 ++Note that the _average_ `nDCG@10` score is computed across all the 327 test queries. +You can also experiment beyond a single metric and modify the script to calculate +more [measures](https://ir-measur.es/en/latest/measures.html), for example, including precision with a relevance label cutoff of 2: + +
+metrics = [nDCG@10, P(rel=2)@10] ++ +## Hybrid Search & Ranking + +We demonstrated and evaluated two independent retrieval and ranking strategies in the previous sections. +Now, we want to explore hybrid search techniques where we combine: + +- traditional lexical keyword matching with a text scoring method (BM25) +- embedding-based search using a text embedding model + +With Vespa, there is a distinction between retrieval (matching) and configurable [ranking](../ranking.html). + +In the Vespa ranking phases, we can express arbitrary scoring complexity with the full power of the Vespa [ranking](../ranking.html) framework. +Meanwhile, top-k retrieval relies on simple built-in functions associated with Vespa's top-k query operators. +These top-k operators aim to avoid scoring all documents in the collection for a query by using a simplistic scoring function to identify the top-k documents. + +These top-k query operators use `index` structures to accelerate the query evaluation, avoiding scoring all documents using heuristics. In the context of hybrid text +search, the following Vespa top-k query operators are relevant: + +- YQL `{targetHits:k}nearestNeighbor()` for dense representations (text embeddings) using +a configured [distance-metric](../reference/schema-reference.html#distance-metric) as the scoring function. +- YQL `{targetHits:k}userInput(@user-query)` which by default uses [weakAnd](../using-wand-with-vespa.html) for sparse representations + + +We can combine these operators using boolean query operators like AND/OR/RANK to express a hybrid search query. Then, there is a wild number of +ways that we can combine various signals in [ranking](../ranking.html). + + +### Define our first simple hybrid rank profile + +First, we can add our first simple hybrid rank profile that combines the dense and sparse components using multiplication to +combine them into a single score. + +
+closeness(field, embedding) * (1 + bm25(title) + bm25(text)) ++ +- the [closeness(field, embedding)](../reference/rank-features.html#attribute-match-features-normalized) rank-feature returns a normalized score in the range 0 to 1 inclusive +- Any of the per-field BM25 scores are in the range of 0 to infinity + +We add a bias constant (1) to avoid the overall score becoming 0 if the document does not match any query terms, +as the BM25 scores would be 0. We also add `match-features` to be able to debug each of the scores. + + +
+schema doc { + document doc { + field language type string { + indexing: "en" | set_language + } + field doc_id type string { + indexing: attribute | summary + match: word + } + field title type string { + indexing: index | summary + match: text + index: enable-bm25 + } + field text type string { + indexing: index | summary + match: text + index: enable-bm25 + } + } + fieldset default { + fields: title, text + } + + field embedding type tensor<bfloat16>(v[384]) { + indexing: input title." ".input text | embed | attribute + attribute { + distance-metric: angular + } + } + + rank-profile hybrid { + inputs { + query(e) tensor<bfloat16>(v[384]) + } + first-phase { + expression: closeness(field, embedding) * (1 + (bm25(title) + bm25(text))) + } + match-features: bm25(title) bm25(text) closeness(field, embedding) + } +} ++
+$ vespa deploy --wait 300 app ++
+$ vespa query \ + 'yql=select * from doc where ({targetHits:10}userInput(@user-query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \ + 'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \ + 'input.query(e)=embed(@user-query)' \ + 'hits=1' \ + 'language=en' \ + 'ranking=hybrid' ++
{% highlight json %} +{ + "root": { + "id": "toplevel", + "relevance": 1.0, + "fields": { + "totalCount": 105 + }, + "coverage": { + "coverage": 100, + "documents": 3633, + "full": true, + "nodes": 1, + "results": 1, + "resultsFull": 1 + }, + "children": [ + { + "id": "id:doc:doc::MED-10", + "relevance": 15.898915593367988, + "source": "content", + "fields": { + "matchfeatures": { + "bm25(text)": 17.35556767018612, + "bm25(title)": 8.166249756144769, + "closeness(field,embedding)": 0.5994655395517325 + }, + "sddocname": "doc", + "documentid": "id:doc:doc::MED-10", + "doc_id": "MED-10", + "title": "Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland", + "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients." + } + } + ] + } +}{% endhighlight %}+ +What is going on here is that we are combining the two top-k query operators using a boolean OR (disjunection). +The `totalCount` is the number of documents retrieved into ranking (About 100, which is higher than 10 + 10). +The `relevance` is the score assigned by `hybrid` rank-profile. Notice that the `matchfeatures` field shows all the feature scores. This is +useful for debugging and understanding the ranking behavior, also for feature logging. + +#### Hybrid query with AND operator +The following combines the two top-k operators using AND, meaning that the retrieved documents must match both the sparse and dense top-k operators. + +
+$ vespa query \ + 'yql=select * from doc where ({targetHits:10}userInput(@user-query)) and ({targetHits:10}nearestNeighbor(embedding,e))' \ + 'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \ + 'input.query(e)=embed(@user-query)' \ + 'hits=1' \ + 'language=en' \ + 'ranking=hybrid' ++
+$ vespa query \ + 'yql=select * from doc where rank(({targetHits:10}nearestNeighbor(embedding,e)), ({targetHits:10}userInput(@user-query)))' \ + 'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \ + 'input.query(e)=embed(@user-query)' \ + 'hits=1' \ + 'language=en' \ + 'ranking=hybrid' ++
+$ vespa query \ + 'yql=select * from doc where rank(({targetHits:10}userInput(@user-query)),({targetHits:10}nearestNeighbor(embedding,e)))' \ + 'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \ + 'input.query(e)=embed(@user-query)' \ + 'hits=1' \ + 'language=en' \ + 'ranking=hybrid' ++
+$ python3 evaluate_ranking.py --ranking hybrid --mode hybrid ++
+Ranking metric NDCG@10 for rank profile hybrid: 0.3275 ++ +The `nDCG@10` score is slightly higher than the profiles that only use one of the ranking strategies. + +Now, we can experiment with more complex ranking expressions that combine the two retrieval strategies. W +e add a few more rank profiles to the schema that combine the two retrieval strategies in different ways. + +
+schema doc { + document doc { + field language type string { + indexing: "en" | set_language + } + field doc_id type string { + indexing: attribute | summary + match: word + } + field title type string { + indexing: index | summary + match: text + index: enable-bm25 + } + field text type string { + indexing: index | summary + match: text + index: enable-bm25 + } + } + fieldset default { + fields: title, text + } + + field embedding type tensor<bfloat16>(v[384]) { + indexing: input title." ".input text | embed | attribute + attribute { + distance-metric: angular + } + } + + rank-profile hybrid { + inputs { + query(e) tensor<bfloat16>(v[384]) + } + first-phase { + expression: closeness(field, embedding) * (1 + (bm25(title) + bm25(text))) + } + match-features: bm25(title) bm25(text) closeness(field, embedding) + } + + rank-profile hybrid-sum inherits hybrid { + first-phase { + expression: closeness(field, embedding) + ((bm25(title) + bm25(text))) + } + } + + rank-profile hybrid-normalize-bm25-with-atan inherits hybrid { + + function scale(val) { + expression: 2*atan(val/8)/(3.14159) + } + function normalized_bm25() { + expression: scale(bm25(title) + bm25(text)) + } + function cosine() { + expression: cos(distance(field, embedding)) + } + first-phase { + expression: normalized_bm25 + cosine + } + match-features { + normalized_bm25 + cosine + bm25(title) + bm25(text) + } + } + + rank-profile hybrid-rrf inherits hybrid-normalize-bm25-with-atan{ + + function bm25_score() { + expression: bm25(title) + bm25(text) + } + global-phase { + rerank-count: 100 + expression: reciprocal_rank(bm25_score) + reciprocal_rank(cosine) + } + match-features: bm25(title) bm25(text) bm25_score cosine + } + + rank-profile hybrid-linear-normalize inherits hybrid-normalize-bm25-with-atan{ + + function bm25_score() { + expression: bm25(title) + bm25(text) + } + global-phase { + rerank-count: 100 + expression: normalize_linear(bm25_score) + normalize_linear(cosine) + } + match-features: bm25(title) bm25(text) bm25_score cosine + } +} ++
+$ vespa deploy --wait 300 app ++
+$ python3 evaluate_ranking.py --ranking hybrid-sum --mode hybrid ++
+Ranking metric NDCG@10 for rank profile hybrid-sum: 0.3232 ++ + +
+$ python3 evaluate_ranking.py --ranking hybrid-normalize-bm25-with-atan --mode hybrid ++
+Ranking metric NDCG@10 for rank profile hybrid-normalize-bm25-with-atan: 0.3386 ++ +
+$ python3 evaluate_ranking.py --ranking hybrid-rrf --mode hybrid ++
+Ranking metric NDCG@10 for rank profile hybrid-rrf: 0.3176 ++ +
+$ python3 evaluate_ranking.py --ranking hybrid-linear-normalize --mode hybrid ++
+Ranking metric NDCG@10 for rank profile hybrid-linear-normalize: 0.3356 ++ +On this particular dataset, the `hybrid-normalize-bm25-with-atan` rank profile performs the best, but the difference is small. This also demonstrates that hybrid search +and ranking is a complex problem and that the effectiveness of the hybrid model depends on the dataset and the retrieval strategies. + +These results (which is the best) might not +transfer to your specific retrieval use case and dataset, so it is important to evaluate the effectiveness of a hybrid model on your specific dataset and having +your own relevance judgments. + +See [Improving retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for more information on how to collect relevance judgments for your dataset. + +### Summary + +In this tutorial, we demonstrated combining two retrieval strategies using the Vespa query language and how to expression hybriding ranking using the Vespa ranking framework. + +We showed how to express hybrid queries using the Vespa query language and how to combine the two retrieval strategies using the Vespa ranking framework. We also showed how to evaluate the effectiveness of the hybrid ranking model using one of the datasets that are a part of the BEIR benchmark. We hope this tutorial has given you a good understanding of how to combine different retrieval strategies using Vespa, and that there is not a single silver bullet for all retrieval problems. + +## Cleanup + +
+ $ docker rm -f vespa-hybrid ++
file: " + elem.getAttribute("data-path") + ""); + //elem.innerHTML = html.replace(//g, "?>").replace(//g, ">"); + elem.insertAdjacentHTML("afterend", "