diff --git a/notebooks/enterprise-search/app-search-engine-exporter.ipynb b/notebooks/enterprise-search/app-search-engine-exporter.ipynb index 9bfab3f7..da37cdf5 100644 --- a/notebooks/enterprise-search/app-search-engine-exporter.ipynb +++ b/notebooks/enterprise-search/app-search-engine-exporter.ipynb @@ -66,7 +66,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -100,7 +100,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -129,7 +129,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": { "id": "kpV8K5jHvRK6" }, @@ -147,6 +147,22 @@ " )" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's take a quick look at the synonyms we've migrated. We'll do this via the `GET _synonyms` endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(json.dumps(elasticsearch.synonyms.get_synonym(id=ENGINE_NAME).body, indent=2))" + ] + }, { "cell_type": "markdown", "metadata": { @@ -186,8 +202,7 @@ " }\n", " )\n", "\n", - "\n", - "elasticsearch.query_ruleset.put(ruleset_id=ENGINE_NAME, rules=query_rules)" + "elasticsearch.query_rules.put_ruleset(ruleset_id=ENGINE_NAME, rules=query_rules)" ] }, { @@ -265,9 +280,16 @@ "Also note that below, we set up variables for our `SOURCE_INDEX` and `DEST_INDEX`. If you want your destination index to be named differently, you can edit it here as these variables are used throughout the rest of the notebook." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, we'll start by defining our source and destination indices. We'll also ensure that if the destination index is deleted if it exists, so that we start fresh." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -277,8 +299,210 @@ "\n", "# delete the index if it's already created\n", "if elasticsearch.indices.exists(index=DEST_INDEX):\n", - " elasticsearch.indices.delete(index=DEST_INDEX)\n", + " elasticsearch.indices.delete(index=DEST_INDEX)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we'll create our settings which includes filters and and analyzers to use for our text fields.\n", + "\n", + "These are similar to the Elasticsearch analyzers we use for App Search. The main difference is that we are also adding a synonyms filter so that we can\n", + "leverage the Elasticsearch synonym set we created in a previous step. If you want a different mapping for text fields, feel free to modify.\n", + "\n", + "To start with, we'll define a number of filters that we can reuse in our analyzer itself. These include:\n", + "* `front_ngram`: defines a front loaded [n-gram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html) that can help create prefixes for terms.\n", + "* `bigram_max_size`: defines a [maximum length](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html) for any bigram. In our example, we exclude any bigrams larger than 16 characters.\n", + "* `en-stem-filter`: defines [a stemmer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html) for use with English text.\n", + "* `bigram_joiner_unigrams`: a filter that [adds word n-grams](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html) into our token stream. This helps to expand the query to capture more context.\n", + "* `delimiter`: a [word delimiter graph token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html) with this rules we've set on how to explicitly split tokens in our input.\n", + "* `en-stop-words-filter`: a default [stop token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html) to remove common English terms from our input.\n", + "* `synonyms-filter`: a [synonym graph token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html) that allows us to reuse the synonym set that we've defined above.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "settings_analysis_filters = {\n", + " \"front_ngram\": {\"type\": \"edge_ngram\", \"min_gram\": \"1\", \"max_gram\": \"12\"},\n", + " \"bigram_joiner\": {\n", + " \"max_shingle_size\": \"2\",\n", + " \"token_separator\": \"\",\n", + " \"output_unigrams\": \"false\",\n", + " \"type\": \"shingle\",\n", + " },\n", + " \"bigram_max_size\": {\"type\": \"length\", \"max\": \"16\", \"min\": \"0\"},\n", + " \"en-stem-filter\": {\"name\": \"light_english\", \"type\": \"stemmer\"},\n", + " \"bigram_joiner_unigrams\": {\n", + " \"max_shingle_size\": \"2\",\n", + " \"token_separator\": \"\",\n", + " \"output_unigrams\": \"true\",\n", + " \"type\": \"shingle\",\n", + " },\n", + " \"delimiter\": {\n", + " \"split_on_numerics\": \"true\",\n", + " \"generate_word_parts\": \"true\",\n", + " \"preserve_original\": \"false\",\n", + " \"catenate_words\": \"true\",\n", + " \"generate_number_parts\": \"true\",\n", + " \"catenate_all\": \"true\",\n", + " \"split_on_case_change\": \"true\",\n", + " \"type\": \"word_delimiter_graph\",\n", + " \"catenate_numbers\": \"true\",\n", + " \"stem_english_possessive\": \"true\",\n", + " },\n", + " \"en-stop-words-filter\": {\"type\": \"stop\", \"stopwords\": \"_english_\"},\n", + " \"synonyms-filter\": {\n", + " \"type\": \"synonym_graph\",\n", + " \"synonyms_set\": ENGINE_NAME,\n", + " \"updateable\": True,\n", + " },\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we'll create our analyzer that utilizes these filters. The various parts of the analyzer will be used in different parts of our field mapping for text, and will help us to be able to index and query our text in different ways. These include:\n", "\n", + "* `iq_text_delimiter` is used for tokenizing and searching terms split on our specified delimiters in our text.\n", + "* `i_prefix` and `q_prefix` define our indexing and query tokenizers for creating prefix versions of our terms.\n", + "* `iq_text_stem` is used to create and query on stemmed versions of our tokens.\n", + "* `i_text_bigram` and `q_text_bigram` define our tokenizers for indexing and querying to create bigram terms.\n", + "* `i_text_base` and `q_text_base` define the indexing and query tokenization rules for general text tokenization." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "settings_analyzer = {\n", + " \"i_prefix\": {\n", + " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\", \"front_ngram\"],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"iq_text_delimiter\": {\n", + " \"filter\": [\n", + " \"delimiter\",\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stop-words-filter\",\n", + " \"en-stem-filter\",\n", + " ],\n", + " \"tokenizer\": \"whitespace\",\n", + " },\n", + " \"q_prefix\": {\n", + " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\"],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"i_text_base\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stop-words-filter\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"q_text_base\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stop-words-filter\",\n", + " \"synonyms-filter\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"iq_text_stem\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stop-words-filter\",\n", + " \"en-stem-filter\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"i_text_bigram\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stem-filter\",\n", + " \"bigram_joiner\",\n", + " \"bigram_max_size\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"q_text_bigram\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"synonyms-filter\",\n", + " \"en-stem-filter\",\n", + " \"bigram_joiner_unigrams\",\n", + " \"bigram_max_size\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we'll combine our filters and our analyzer into a settings object that we can use to define our destination index's settings.\n", + "\n", + "More information on creating custom analyzers can be found in the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html)." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "settings = {\n", + " \"analysis\": {\n", + " \"filter\": settings_analysis_filters,\n", + " \"analyzer\": settings_analyzer,\n", + " }\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have our settings built for our analysis, we'll get the current schema from our App Search engine and use that to build the mappings for our destination index we will be migrating the data into.\n", + "\n", + "For any text fields, we'll explicitly define that mappings for how we want these fields to be stored. We define a number of fields here to emulate what App Search does underneath the hood. These include:\n", + "* A `keyword` field that ignores any token greater than 2048 characters in length.\n", + "* A `delimiter` field that captures any delimiters that we've defined in the above `delimiter` analysis.\n", + "* A `joined` field that uses our bigram analysis from above. This will create pairs of joined tokens that can be used for phrase queries.\n", + "* A `prefix` field that uses our prefix analysis from above. This is used for prefix wildcard to allow for partial matches as well as autocomplete queries.\n", + "* A `stem` field that captures the stemmed versions of our tokens.\n", + "\n", + "Finally, the overall text field will be fully stored and analyzed using our base analyzer that we've defined above." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ "# get the App Search engine schema\n", "schema = app_search.get_schema(engine_name=ENGINE_NAME)\n", "\n", @@ -326,125 +550,22 @@ " \"index_options\": \"freqs\",\n", " \"analyzer\": \"i_text_base\",\n", " \"search_analyzer\": \"q_text_base\",\n", - " }\n", - "\n", - "# These are similar to the Elasticsearch analyzers we use for App Search.\n", - "# The main difference is that we are also adding a synonyms filter so that we can\n", - "# leverage the Elasticsearch synonym set we created in a previous step.\n", - "# If you want a different mapping for text fields, feel free to modify.\n", - "settings = {\n", - " \"analysis\": {\n", - " \"filter\": {\n", - " \"front_ngram\": {\"type\": \"edge_ngram\", \"min_gram\": \"1\", \"max_gram\": \"12\"},\n", - " \"bigram_joiner\": {\n", - " \"max_shingle_size\": \"2\",\n", - " \"token_separator\": \"\",\n", - " \"output_unigrams\": \"false\",\n", - " \"type\": \"shingle\",\n", - " },\n", - " \"bigram_max_size\": {\"type\": \"length\", \"max\": \"16\", \"min\": \"0\"},\n", - " \"en-stem-filter\": {\"name\": \"light_english\", \"type\": \"stemmer\"},\n", - " \"bigram_joiner_unigrams\": {\n", - " \"max_shingle_size\": \"2\",\n", - " \"token_separator\": \"\",\n", - " \"output_unigrams\": \"true\",\n", - " \"type\": \"shingle\",\n", - " },\n", - " \"delimiter\": {\n", - " \"split_on_numerics\": \"true\",\n", - " \"generate_word_parts\": \"true\",\n", - " \"preserve_original\": \"false\",\n", - " \"catenate_words\": \"true\",\n", - " \"generate_number_parts\": \"true\",\n", - " \"catenate_all\": \"true\",\n", - " \"split_on_case_change\": \"true\",\n", - " \"type\": \"word_delimiter_graph\",\n", - " \"catenate_numbers\": \"true\",\n", - " \"stem_english_possessive\": \"true\",\n", - " },\n", - " \"en-stop-words-filter\": {\"type\": \"stop\", \"stopwords\": \"_english_\"},\n", - " \"synonyms-filter\": {\n", - " \"type\": \"synonym_graph\",\n", - " \"synonyms_set\": ENGINE_NAME,\n", - " \"updateable\": True,\n", - " },\n", - " },\n", - " \"analyzer\": {\n", - " \"i_prefix\": {\n", - " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\", \"front_ngram\"],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"iq_text_delimiter\": {\n", - " \"filter\": [\n", - " \"delimiter\",\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stop-words-filter\",\n", - " \"en-stem-filter\",\n", - " ],\n", - " \"tokenizer\": \"whitespace\",\n", - " },\n", - " \"q_prefix\": {\n", - " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\"],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"i_text_base\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stop-words-filter\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"q_text_base\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stop-words-filter\",\n", - " \"synonyms-filter\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"iq_text_stem\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stop-words-filter\",\n", - " \"en-stem-filter\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"i_text_bigram\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stem-filter\",\n", - " \"bigram_joiner\",\n", - " \"bigram_max_size\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"q_text_bigram\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"synonyms-filter\",\n", - " \"en-stem-filter\",\n", - " \"bigram_joiner_unigrams\",\n", - " \"bigram_max_size\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " },\n", - " }\n", - "}\n", - "\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And now, we create our destination index that uses our mappings and analysis settings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# and actually create our index\n", "elasticsearch.indices.create(\n", " index=DEST_INDEX, mappings={\"properties\": mapping}, settings=settings\n", @@ -455,13 +576,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Add `sparse_vector` fields for semantic search (optional)\n", + "# Add semantic text fields for semantic search (optional)\n", "\n", - "One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add a `sparse_vector` field to our index, set up an ingest pipeline, and reindex our data.\n", + "One of the advantages of exporting our index directly to Elasticsearch is that we can easily perform semantic search with ELSER. To do this, we'll need to add an inference endpoint using ELSER, and a `semantic_text` field to our index to use it.\n", "\n", "Note that to use this feature, your cluster must have at least one ML node set up with enough resources allocated to it.\n", "\n", - "Let's first start by adding `sparse_vector` fields to our new index mapping." + "If you have not already, be sure that your ELSER v2 model is [setup and deployed](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html).\n", + "\n", + "Let's first start by creating our inference endpoint using the [Create inference API]](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-inference-api.html)." ] }, { @@ -470,29 +593,34 @@ "metadata": {}, "outputs": [], "source": [ - "# by default we are adding a `sparse_vector` field for all text fields in our engine\n", - "# feel free to modify this list to only include the fields that are relevant\n", - "SPARSE_VECTOR_FIELDS = [\n", - " field_name + \"_semantic\" for field_name in schema if schema[field_name] == \"text\"\n", - "]\n", - "\n", - "sparse_vector_fields = {}\n", - "for field_name in SPARSE_VECTOR_FIELDS:\n", - " # this is added so we can use semantic search with ELSER\n", - " sparse_vector_fields[field_name] = {\"type\": \"sparse_vector\"}\n", - "\n", - "elasticsearch.indices.put_mapping(index=DEST_INDEX, properties=sparse_vector_fields)" + "# delete our inference endpoint if it is already created\n", + "if elasticsearch.inference.get(inference_id=\"elser_inference_endpoint\"):\n", + " elasticsearch.inference.delete(inference_id=\"elser_inference_endpoint\")\n", + "\n", + "# and create our endpoint using the ELSER v2 model\n", + "elasticsearch.inference.put(\n", + " inference_id=\"elser_inference_endpoint\",\n", + " inference_config={\n", + " \"service\": \"elasticsearch\",\n", + " \"service_settings\": {\n", + " \"model_id\": \".elser_model_2_linux-x86_64\",\n", + " \"num_allocations\": 1,\n", + " \"num_threads\": 1,\n", + " },\n", + " },\n", + " task_type=\"sparse_embedding\",\n", + ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Setup an ingest pipeline using ELSER\n", + "## Using semantic text fields for ingest and query\n", "\n", - "> If you have not already deployed ELSER, follow this [guide](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) on how to download and deploy the model. Without this step, you will receive errors below when you run the `reindex` command.\n", + "Next, we'll augment our text fields with `semantic_text` fields in our index. We'll do this by creating a `semtantic_text` field, and providing a `copy_to` directive from the original source field to copy the text into our semantic text fields.\n", "\n", - "Assuming you have downloaded and deployed ELSER in your deployment, we can now define an ingest pipeline that will enrich the documents with the `sparse_vector` fields that can be used with semantic search." + "In the example below, we are using the `description` and `title` fields from our example index to add semantic search on those fields." ] }, { @@ -501,43 +629,24 @@ "metadata": {}, "outputs": [], "source": [ - "PIPELINE = \"elser-ingest-pipeline-\" + ENGINE_NAME\n", - "\n", - "processors = []\n", + "# by default we are adding a `semantic_text` field for the \"description\" and \"title\" fields in our schema\n", + "# feel free to modify this list to only include the fields that are relevant\n", + "SEMANTIC_TEXT_FIELDS = [\"description\", \"title\"]\n", + "\n", + "# add the semantic_text field to our mapping for each field defined\n", + "for field_name in SEMANTIC_TEXT_FIELDS:\n", + " semantic_field_name = field_name + \"_semantic\"\n", + " mapping[semantic_field_name] = {\n", + " \"type\": \"semantic_text\",\n", + " \"inference_id\": \"elser_inference_endpoint\",\n", + " }\n", "\n", - "for output_field in SPARSE_VECTOR_FIELDS:\n", - " input_field = output_field.removesuffix(\"_semantic\")\n", - " processors.append(\n", - " {\n", - " \"inference\": {\n", - " \"model_id\": \".elser_model_2\",\n", - " \"input_output\": [\n", - " {\"input_field\": input_field, \"output_field\": output_field}\n", - " ],\n", - " \"on_failure\": [\n", - " {\n", - " \"append\": {\n", - " \"field\": \"_source._ingest.inference_errors\",\n", - " \"allow_duplicates\": False,\n", - " \"value\": [\n", - " {\n", - " \"message\": \"Processor failed for field '\"\n", - " + input_field\n", - " + \"' with message '{{ _ingest.on_failure_message }}'\",\n", - " \"timestamp\": \"{{{ _ingest.timestamp }}}\",\n", - " }\n", - " ],\n", - " }\n", - " }\n", - " ],\n", - " }\n", - " }\n", - " )\n", + "# and for our text fields, add a \"copy_to\" directive to copy the text to the semantic_text field\n", + "for field_name in SEMANTIC_TEXT_FIELDS:\n", + " semantic_field_name = field_name + \"_semantic\"\n", + " mapping[field_name].update({\"copy_to\": semantic_field_name})\n", "\n", - "# create the ingest pipeline\n", - "elasticsearch.ingest.put_pipeline(\n", - " id=PIPELINE, description=\"Ingest pipeline for ELSER\", processors=processors\n", - ")" + "elasticsearch.indices.put_mapping(index=DEST_INDEX, properties=mapping)" ] }, { @@ -545,18 +654,18 @@ "metadata": {}, "source": [ "## Reindex the data\n", - "Now that we have created the Elasticsearch index and the ingest pipeline, it's time to reindex our data in the new index. The pipeline definition we created above will create a field for each of the `SPARSE_VECTOR_FIELDS` we defined with a `_semantic` suffix, and then infer the sparse vector values from ELSER as the reindex takes place." + "Now that we have created the Elasticsearch index, it's time to reindex our data in the new index. If you are using the `semantic_text` fields as defined above with a `_semantic` suffix, and then the reindexing process will automatically infer the sparse vector values from ELSER and use those for the vectors as the reindex takes place." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "reindex_task = elasticsearch.reindex(\n", " source={\"index\": SOURCE_INDEX},\n", - " dest={\"index\": DEST_INDEX, \"pipeline\": PIPELINE},\n", + " dest={\"index\": DEST_INDEX},\n", " wait_for_completion=False,\n", ")\n", "\n", @@ -606,12 +715,17 @@ "}'\n", "```\n", "\n", - "From the output of the API call above, we can see the actual Elasticsearch query that will be used. Below, we are using this query as a base to build our own App Search like query using query rules and our Elasticsearch synonyms. The query is further enhanced by augmentation with the built-in App Search multifield types for such things as stemming and prefix matching." + "From the output of the API call above, we can see the actual Elasticsearch query that will be used. Below, we are using this query as a base to build our own App Search like query using query rules and our Elasticsearch synonyms. The query is further enhanced by augmentation with the built-in App Search multifield types for such things as stemming and prefix matching.\n", + "\n", + "To walk through a bit of what is happening in the query below. First, we gather some preliminary information about the fields we want to query and return.\n", + "1) We gather the fields we want for our result. This includes all the keys in the schema from above.\n", + "2) Next, we gather all of our text fields in our schema\n", + "3) And finally we gather the \"best fields\" which are those we want to query on using our stemmer." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "metadata": {}, "outputs": [], "source": [ @@ -620,8 +734,30 @@ "result_fields = list(schema.keys())\n", "\n", "text_fields = [field_name for field_name in schema if schema[field_name] == \"text\"]\n", - "best_fields = [field_name + \".stem\" for field_name in text_fields]\n", + "best_fields = [field_name + \".stem\" for field_name in text_fields]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, from our text fields, we create set of fields with specified weights for our various analyzers.\n", + "\n", + "* For the text field itself, we weight this as neutral, with a `1.0`\n", + "* For any stem fields, we weight this _slightly_ less to pull in closely stemmed words in the query.\n", + "* Any prefixes, we weight this with a minimal weight to ensure these do not dominate our scoring.\n", + "* For any potential bigram phrase matches, we weight these as well with a `0.75`\n", + "* Finally for our delimiter analyzed terms, we wight these somewhere in the middle.\n", "\n", + "These are the default weightings that App Search uses. Feel free to experiement with these values to find a balance that works for you." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ "cross_fields = []\n", "\n", "for text_field in text_fields:\n", @@ -629,11 +765,31 @@ " cross_fields.append(text_field + \".stem^0.95\")\n", " cross_fields.append(text_field + \".prefix^0.1\")\n", " cross_fields.append(text_field + \".joined^0.75\")\n", - " cross_fields.append(text_field + \".delimiter^0.4\")\n", + " cross_fields.append(text_field + \".delimiter^0.4\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we're ready to create our actual payload for our query. This is analagous to the query that App Search uses when querying.\n", + "\n", + "Within this query, we first set an organic query rule. This defines a boolean query under the hood that allows a match to be found and scored either in our cross fields we defined above, or in the \"best fields\" as defined.\n", "\n", + "For the results, we sort on our score descending as the primary sort, with the document id as the secondary.\n", + "\n", + "We apply highlights to our results, request a return size of the top 10 hits, and for each hit, return the result fields." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "app_search_query_payload = {\n", " \"query\": {\n", - " \"rule_query\": {\n", + " \"rule\": {\n", " \"organic\": {\n", " \"bool\": {\n", " \"should\": [\n", @@ -658,7 +814,7 @@ " ]\n", " }\n", " },\n", - " \"ruleset_id\": ENGINE_NAME,\n", + " \"ruleset_ids\": [ENGINE_NAME],\n", " \"match_criteria\": {\"user_query\": QUERY_STRING},\n", " }\n", " },\n", @@ -707,11 +863,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### How to do semantic search using ELSER\n", + "### How to do semantic search using ELSER with semantic text fields\n", "\n", "If you [enabled and reindexed your data with ELSER](#add-sparse_vector-fields-for-semantic-search-optional), we can now use this to do semantic search.\n", - "For each `spare_vector` we will generate a `text_expansion` query. These `text_expansion` queries will be added as `should` clauses to a top-level `bool` query.\n", - "We also use `min_score` because we want to exclude less relevant results. " + "For each `semantic_text` field type, we can define a [semantic query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-semantic-query.html) to easily perform a semantic search on these fields.\n" ] }, { @@ -721,19 +876,21 @@ "outputs": [], "source": [ "# replace with your own\n", - "QUERY_STRING = \"Which national park has dangerous wild animals?\"\n", - "text_expansion_queries = []\n", + "QUERY_STRING = \"best sunset view\"\n", + "semantic_text_queries = []\n", "\n", - "for field_name in SPARSE_VECTOR_FIELDS:\n", - " text_expansion_queries.append(\n", + "for field_name in SEMANTIC_TEXT_FIELDS:\n", + " semantic_field_name = field_name + \"_semantic\"\n", + " semantic_text_queries.append(\n", " {\n", - " \"text_expansion\": {\n", - " field_name: {\"model_id\": \".elser_model_2\", \"model_text\": QUERY_STRING}\n", + " \"semantic\": {\n", + " \"field\": semantic_field_name,\n", + " \"query\": QUERY_STRING,\n", " }\n", " }\n", " )\n", "\n", - "semantic_query = {\"bool\": {\"should\": text_expansion_queries}}\n", + "semantic_query = {\"bool\": {\"should\": semantic_text_queries}}\n", "print(f\"Elasticsearch query:\\n{json.dumps(semantic_query, indent=2)}\\n\")" ] }, @@ -743,7 +900,7 @@ "metadata": {}, "outputs": [], "source": [ - "results = elasticsearch.search(index=DEST_INDEX, query=semantic_query, min_score=20)\n", + "results = elasticsearch.search(index=DEST_INDEX, query=semantic_query, min_score=1)\n", "print(f\"Query results:\\n{json.dumps(results.body, indent=2)}\\n\")" ] }, @@ -751,7 +908,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### How to combine App Search queries with ELSER\n", + "### How to combine App Search queries with semantic text\n", "\n", "We will now provide an example on how to combine the previous two queries into a single query that applies both BM25 search and semantic search.\n", "In the previous examples, we have a `bool` query with `should` clauses.\n", @@ -790,10 +947,8 @@ "source": [ "payload = app_search_query_payload.copy()\n", "\n", - "for text_expansion_query in text_expansion_queries:\n", - " payload[\"query\"][\"rule_query\"][\"organic\"][\"bool\"][\"should\"].append(\n", - " text_expansion_query\n", - " )\n", + "for semantic_text_query in semantic_text_queries:\n", + " payload[\"query\"][\"rule\"][\"organic\"][\"bool\"][\"should\"].append(semantic_text_query)\n", "\n", "print(f\"Elasticsearch payload:\\n{json.dumps(payload, indent=2)}\\n\")" ] @@ -823,7 +978,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3.12.3 64-bit", + "display_name": "Python 3", "language": "python", "name": "python3" }, @@ -837,12 +992,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.3" - }, - "vscode": { - "interpreter": { - "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" - } + "version": "3.11.9" } }, "nbformat": 4,