-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update App Search Migration Notebook to Correct Some Errors and Warnings, and Add More Descriptive Text #348
Conversation
Found 1 changed notebook. Review the changes at https://app.gitnotebooks.com/elastic/elasticsearch-labs/pull/348 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good!
The high level feedback I have though, outside of some of the specific feedback that I gave, is that we should consider pointing people to retrievers in these examples. This will be especially important for the rule query which will not work with rerankers (but the rule retriever will work with rerank retrievers such as RRF or semantic reranking). It wouldn't be too hard to migrate these from the query DSL to retriever syntax and it has the added advantage that it will be consistent with what we do in Playground. Thanks!
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print(json.dumps(elasticsearch.synonyms.get_synonym(id=ENGINE_NAME).body, indent=2))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a clients bug, because GET synonyms
is the appropriate call to use here, I would think this should be plural. (See documentation here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I ran through the notebook to verify, I would definitely say it's a bug in the specfication.
@@ -186,8 +202,7 @@ | |||
" }\n", | |||
" )\n", | |||
"\n", | |||
"\n", | |||
"elasticsearch.query_ruleset.put(ruleset_id=ENGINE_NAME, rules=query_rules)" | |||
"elasticsearch.query_rules.put_ruleset(ruleset_id=ENGINE_NAME, rules=query_rules)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating - this was refactored when we added the CRUD API for individual rules.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"First, we'll start by defining our source and destination indices. We'll also ensure that if the destination index exists, to delete it first so we start fresh." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"First, we'll start by defining our source and destination indices. We'll also ensure that if the destination index exists, to delete it first so we start fresh." | |
"First, we'll start by defining our source and destination indices. We'll also ensure the destination index is deleted if it exists, so that we start fresh." |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next, we'll create our settings which includes filters and and analyzers to use for our text fields.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of the scope of this PR, but I feel like this could be a good documentation guide outside of a notebook, that we could link to here. Similar to some of the other startup guides that @leemthompo has been creating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - I have a forthcoming blog post that I was planning on diving into this a bit more as well.
@@ -711,7 +882,9 @@ | |||
"\n", | |||
"If you [enabled and reindexed your data with ELSER](#add-sparse_vector-fields-for-semantic-search-optional), we can now use this to do semantic search.\n", | |||
"For each `spare_vector` we will generate a `text_expansion` query. These `text_expansion` queries will be added as `should` clauses to a top-level `bool` query.\n", | |||
"We also use `min_score` because we want to exclude less relevant results. " | |||
"We also use `min_score` because we want to exclude less relevant results. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be generating text_expansion
here if we can help it, we should use sparse_vector
or better yet semantic
queries instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's all good on my side if @markjhoy wants to follow up on this in a follow up PR or do it here.
changing from text_expansion
to use semantic
will require changing the section on how we index these fields as well - right now the notebook has a guide on how to use ingest pipelines to use ELSER with text_expansion
.
Using semantic_text
instead would be much better - we can also point users to use E5 when dealing with a non-english dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes look good, provided we also address the feedback from Kathleen.
We can make the changes to move from ingest pipelines with ELSER to semantic_text in a follow up PR of here - it's up to you.
I have not looked into the CI failures whether they are related.
Ugh - it looks like the CI failures are version specific issues... I'll have to see if I can either (a) lock the notebook to 8.16+ - or, make the code version specific for backwards compatibility... |
Oh wait - nevermind this... the errors we are seeing in the CI are actually from other notebooks, namely:
It appears that our migration notebook is fine... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for iterating!
I left some feedback, but it is non blocking. I do think we should consider being retrievers-first here, but I defer to you and the ent search deprecation team on the final decision.
"\n", | ||
"One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add a `sparse_vector` field to our index, set up an ingest pipeline, and reindex our data.\n", | ||
"One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add an inference endpoint using ELSER, and a `semantic_text` field to our index to use it.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Trying to avoid two uses of advantage
in the same sentence
"One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add an inference endpoint using ELSER, and a `semantic_text` field to our index to use it.\n", | |
"One of the advantages of exporting our index directly to Elasticsearch is that we can easily perform semantic search with ELSER. To do this, we'll need to add an inference endpoint using ELSER, and a `semantic_text` field to our index to use it.\n", |
" inference_config={\n", | ||
" \"service\": \"elasticsearch\",\n", | ||
" \"service_settings\": {\n", | ||
" \"model_id\": \".elser_model_2_linux-x86_64\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a note that this will not work on aarch64? Or is that moot because we're using cloud for the examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - since it's cloud - it's a moot point.
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Reindex the data\n", | ||
"Now that we have created the Elasticsearch index and the ingest pipeline, it's time to reindex our data in the new index. The pipeline definition we created above will create a field for each of the `SPARSE_VECTOR_FIELDS` we defined with a `_semantic` suffix, and then infer the sparse vector values from ELSER as the reindex takes place." | ||
"Now that we have created the Elasticsearch index, it's time to reindex our data in the new index. If you are using the `semantic_text` fields as defined above with a `_semantic` suffix, and then the reindexing process with automatically infer the sparse vector values from ELSER and use those for the vectors as the reindex takes place." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Now that we have created the Elasticsearch index, it's time to reindex our data in the new index. If you are using the `semantic_text` fields as defined above with a `_semantic` suffix, and then the reindexing process with automatically infer the sparse vector values from ELSER and use those for the vectors as the reindex takes place." | |
"Now that we have created the Elasticsearch index, it's time to reindex our data in the new index. If you are using the `semantic_text` fields as defined above with a `_semantic` suffix, and then the reindexing process will automatically infer the sparse vector values from ELSER and use those for the vectors as the reindex takes place." |
@@ -658,7 +816,7 @@ | |||
" ]\n", | |||
" }\n", | |||
" },\n", | |||
" \"ruleset_id\": ENGINE_NAME,\n", | |||
" \"ruleset_ids\": ENGINE_NAME,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruleset_ids
needs to be an array of strings
"print(f\"Query results:\\n{json.dumps(results.body, indent=2)}\\n\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### How to combine App Search queries with ELSER\n", | ||
"### How to combine App Search queries with Semantic Text\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to be capitalized?
"\n", | ||
"app_search_query_payload = {\n", | ||
" \"query\": {\n", | ||
" \"rule_query\": {\n", | ||
" \"rule\": {\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have been nice to use retrievers here, perhaps we could wrap this query under a standard retriever (or a rule retriever!) or even use RRF?
Thanks @kderusso! Yeah - for the time being, I'd like to get this version out now, and come back and do a pass using retrievers. I need to still brush up a bit more myself on them but at least using this version with the semantic text fields is a solid upgrade. |
The failures in the CI are from other notebooks and not this (isolated) one. Going to merge this. |
Updates the App Search migration notebook for the following:
put_ruleset
instead ofput
when adding the query rules (was throwing an error with the latest Elasticsearch client)