From b59f3c7ea774464c772de6a206b595cfff25fe02 Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Wed, 18 Sep 2024 10:51:29 +0200 Subject: [PATCH] Chunking notebooks: mention semantic_text (#280) * Chunking notebooks: mention semantic_text * refer to 8.15 * add link to notebook --- notebooks/document-chunking/tokenization.ipynb | 9 ++++++++- notebooks/document-chunking/with-index-pipelines.ipynb | 9 ++++++++- .../document-chunking/with-langchain-splitters.ipynb | 9 ++++++++- 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/notebooks/document-chunking/tokenization.ipynb b/notebooks/document-chunking/tokenization.ipynb index 74e64ee2..87678838 100644 --- a/notebooks/document-chunking/tokenization.ipynb +++ b/notebooks/document-chunking/tokenization.ipynb @@ -15,7 +15,14 @@ "\n", "For users of Elasticsearch it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512th are ignored in your semantic search. Hence it is valuable to know the number of tokens for your input texts before choosing the right model and indexing method.\n", "\n", - "Currently it is not possible to get the token count information via the API, so here we share the code for calculating token counts. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing. Currently (as of version 8.12) this has to be done by the user. Future versions will remove this necessity and Elasticsearch will automatically create chunks behind the scenes." + "Currently it is not possible to get the token count information via the API, so here we share the code for calculating token counts. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing.\n", + "\n", + "# Prefer the `semantic_text` field type\n", + "\n", + "**Elasticsearch version 8.15 introduced the [`semantic_text`](https://www.elastic.co/guide/en/elasticsearch/reference/master/semantic-text.html) field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:**\n", + "\n", + "- ****\n", + "- ****" ] }, { diff --git a/notebooks/document-chunking/with-index-pipelines.ipynb b/notebooks/document-chunking/with-index-pipelines.ipynb index e58c1273..472c592b 100644 --- a/notebooks/document-chunking/with-index-pipelines.ipynb +++ b/notebooks/document-chunking/with-index-pipelines.ipynb @@ -13,7 +13,14 @@ "This interactive notebook will:\n", "- load the model \"sentence-transformers__all-minilm-l6-v2\" from Hugging Face and into Elasticsearch ML Node\n", "- create an index and ingest pipeline that will chunk large fields into smaller passages and vectorize them using the model\n", - "- perform a search and return docs with the most relevant passages" + "- perform a search and return docs with the most relevant passages\n", + "\n", + "# Prefer the `semantic_text` field type\n", + "\n", + "**Elasticsearch version 8.15 introduced the [`semantic_text`](https://www.elastic.co/guide/en/elasticsearch/reference/master/semantic-text.html) field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:**\n", + "\n", + "- ****\n", + "- ****" ] }, { diff --git a/notebooks/document-chunking/with-langchain-splitters.ipynb b/notebooks/document-chunking/with-langchain-splitters.ipynb index 34870c42..b45901f6 100644 --- a/notebooks/document-chunking/with-langchain-splitters.ipynb +++ b/notebooks/document-chunking/with-langchain-splitters.ipynb @@ -12,7 +12,14 @@ "This interactive notebook will:\n", "- load the model \"sentence-transformers__all-minilm-l6-v2\" from Hugging Face and into Elasticsearch ML Node\n", "- Use LangChain splitters to chunk the passages into sentences and index them into Elasticsearch with nested dense vector\n", - "- perform a search and return docs with the most relevant passages" + "- perform a search and return docs with the most relevant passages\n", + "\n", + "# Prefer the `semantic_text` field type\n", + "\n", + "**Elasticsearch version 8.15 introduced the [`semantic_text`](https://www.elastic.co/guide/en/elasticsearch/reference/master/semantic-text.html) field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:**\n", + "\n", + "- ****\n", + "- ****" ] }, {