Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update App Search Migration Notebook to Correct Some Errors and Warnings, and Add More Descriptive Text #348

Merged
merged 5 commits into from
Nov 8, 2024

Conversation

markjhoy
Copy link
Contributor

@markjhoy markjhoy commented Nov 1, 2024

Updates the App Search migration notebook for the following:

  • Prints the synonym set created after it is created
  • Uses put_ruleset instead of put when adding the query rules (was throwing an error with the latest Elasticsearch client)
  • Breaks the index creation piece up into smaller code chunks, and adds descriptive text for what each part does
  • Adds a bit more clarity around the ELSER model name
  • Adds more description around the App Search query; how it's built and what the sections do.

Copy link

gitnotebooks bot commented Nov 1, 2024

Found 1 changed notebook. Review the changes at https://app.gitnotebooks.com/elastic/elasticsearch-labs/pull/348

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good!

The high level feedback I have though, outside of some of the specific feedback that I gave, is that we should consider pointing people to retrievers in these examples. This will be especially important for the rule query which will not work with rerankers (but the rule retriever will work with rerank retrievers such as RRF or semantic reranking). It wouldn't be too hard to migrate these from the query DSL to retriever syntax and it has the added advantage that it will be consistent with what we do in Playground. Thanks!

"metadata": {},
"outputs": [],
"source": [
"print(json.dumps(elasticsearch.synonyms.get_synonym(id=ENGINE_NAME).body, indent=2))"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a clients bug, because GET synonyms is the appropriate call to use here, I would think this should be plural. (See documentation here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend to agree - but the client has it as get_synonym ... :/

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I ran through the notebook to verify, I would definitely say it's a bug in the specfication.

@@ -186,8 +202,7 @@
" }\n",
" )\n",
"\n",
"\n",
"elasticsearch.query_ruleset.put(ruleset_id=ENGINE_NAME, rules=query_rules)"
"elasticsearch.query_rules.put_ruleset(ruleset_id=ENGINE_NAME, rules=query_rules)\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating - this was refactored when we added the CRUD API for individual rules.

"cell_type": "markdown",
"metadata": {},
"source": [
"First, we'll start by defining our source and destination indices. We'll also ensure that if the destination index exists, to delete it first so we start fresh."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"First, we'll start by defining our source and destination indices. We'll also ensure that if the destination index exists, to delete it first so we start fresh."
"First, we'll start by defining our source and destination indices. We'll also ensure the destination index is deleted if it exists, so that we start fresh."

"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll create our settings which includes filters and and analyzers to use for our text fields.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of the scope of this PR, but I feel like this could be a good documentation guide outside of a notebook, that we could link to here. Similar to some of the other startup guides that @leemthompo has been creating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - I have a forthcoming blog post that I was planning on diving into this a bit more as well.

@@ -711,7 +882,9 @@
"\n",
"If you [enabled and reindexed your data with ELSER](#add-sparse_vector-fields-for-semantic-search-optional), we can now use this to do semantic search.\n",
"For each `spare_vector` we will generate a `text_expansion` query. These `text_expansion` queries will be added as `should` clauses to a top-level `bool` query.\n",
"We also use `min_score` because we want to exclude less relevant results. "
"We also use `min_score` because we want to exclude less relevant results. \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be generating text_expansion here if we can help it, we should use sparse_vector or better yet semantic queries instead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's all good on my side if @markjhoy wants to follow up on this in a follow up PR or do it here.
changing from text_expansion to use semantic will require changing the section on how we index these fields as well - right now the notebook has a guide on how to use ingest pipelines to use ELSER with text_expansion.
Using semantic_text instead would be much better - we can also point users to use E5 when dealing with a non-english dataset.

Copy link

@ioanatia ioanatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes look good, provided we also address the feedback from Kathleen.
We can make the changes to move from ingest pipelines with ELSER to semantic_text in a follow up PR of here - it's up to you.

I have not looked into the CI failures whether they are related.

@markjhoy
Copy link
Contributor Author

markjhoy commented Nov 5, 2024

I have not looked into the CI failures whether they are related.

Ugh - it looks like the CI failures are version specific issues... I'll have to see if I can either (a) lock the notebook to 8.16+ - or, make the code version specific for backwards compatibility...

@markjhoy
Copy link
Contributor Author

markjhoy commented Nov 5, 2024

I have not looked into the CI failures whether they are related.

Ugh - it looks like the CI failures are version specific issues... I'll have to see if I can either (a) lock the notebook to 8.16+ - or, make the code version specific for backwards compatibility...

Oh wait - nevermind this... the errors we are seeing in the CI are actually from other notebooks, namely:

  • notebooks/document-chunking/with-index-pipelines.ipynb... Failed
  • notebooks/document-chunking/with-langchain-splitters.ipynb... Failed
  • notebooks/integrations/hugging-face/loading-model-from-hugging-face.ipynb... Failed
  • notebooks/langchain/langchain-using-own-model.ipynb... Failed

It appears that our migration notebook is fine...

@markjhoy
Copy link
Contributor Author

markjhoy commented Nov 7, 2024

I've updated the notebook to use semantic_text search. That new field type makes it so easy to do semantic search now.

@kderusso and @ioanatia

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for iterating!

I left some feedback, but it is non blocking. I do think we should consider being retrievers-first here, but I defer to you and the ent search deprecation team on the final decision.

"\n",
"One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add a `sparse_vector` field to our index, set up an ingest pipeline, and reindex our data.\n",
"One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add an inference endpoint using ELSER, and a `semantic_text` field to our index to use it.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Trying to avoid two uses of advantage in the same sentence

Suggested change
"One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add an inference endpoint using ELSER, and a `semantic_text` field to our index to use it.\n",
"One of the advantages of exporting our index directly to Elasticsearch is that we can easily perform semantic search with ELSER. To do this, we'll need to add an inference endpoint using ELSER, and a `semantic_text` field to our index to use it.\n",

" inference_config={\n",
" \"service\": \"elasticsearch\",\n",
" \"service_settings\": {\n",
" \"model_id\": \".elser_model_2_linux-x86_64\",\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note that this will not work on aarch64? Or is that moot because we're using cloud for the examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - since it's cloud - it's a moot point.

]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reindex the data\n",
"Now that we have created the Elasticsearch index and the ingest pipeline, it's time to reindex our data in the new index. The pipeline definition we created above will create a field for each of the `SPARSE_VECTOR_FIELDS` we defined with a `_semantic` suffix, and then infer the sparse vector values from ELSER as the reindex takes place."
"Now that we have created the Elasticsearch index, it's time to reindex our data in the new index. If you are using the `semantic_text` fields as defined above with a `_semantic` suffix, and then the reindexing process with automatically infer the sparse vector values from ELSER and use those for the vectors as the reindex takes place."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Now that we have created the Elasticsearch index, it's time to reindex our data in the new index. If you are using the `semantic_text` fields as defined above with a `_semantic` suffix, and then the reindexing process with automatically infer the sparse vector values from ELSER and use those for the vectors as the reindex takes place."
"Now that we have created the Elasticsearch index, it's time to reindex our data in the new index. If you are using the `semantic_text` fields as defined above with a `_semantic` suffix, and then the reindexing process will automatically infer the sparse vector values from ELSER and use those for the vectors as the reindex takes place."

@@ -658,7 +816,7 @@
" ]\n",
" }\n",
" },\n",
" \"ruleset_id\": ENGINE_NAME,\n",
" \"ruleset_ids\": ENGINE_NAME,\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruleset_ids needs to be an array of strings

"print(f\"Query results:\\n{json.dumps(results.body, indent=2)}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How to combine App Search queries with ELSER\n",
"### How to combine App Search queries with Semantic Text\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be capitalized?

"\n",
"app_search_query_payload = {\n",
" \"query\": {\n",
" \"rule_query\": {\n",
" \"rule\": {\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been nice to use retrievers here, perhaps we could wrap this query under a standard retriever (or a rule retriever!) or even use RRF?

@markjhoy
Copy link
Contributor Author

markjhoy commented Nov 7, 2024

I left some feedback, but it is non blocking. I do think we should consider being retrievers-first here, but I defer to you and the ent search deprecation team on the final decision.

Thanks @kderusso! Yeah - for the time being, I'd like to get this version out now, and come back and do a pass using retrievers. I need to still brush up a bit more myself on them but at least using this version with the semantic text fields is a solid upgrade.

@markjhoy
Copy link
Contributor Author

markjhoy commented Nov 8, 2024

The failures in the CI are from other notebooks and not this (isolated) one. Going to merge this.

@markjhoy markjhoy merged commit 90891b7 into main Nov 8, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants