Skip to content

Commit

Permalink
refine full_text_search_with_langchain
Browse files Browse the repository at this point in the history
Signed-off-by: ChengZi <[email protected]>
  • Loading branch information
zc277584121 committed Jan 13, 2025
1 parent ea49f33 commit 4fd8ad2
Showing 1 changed file with 22 additions and 22 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
"\n",
"[Full-text search](https://milvus.io/docs/full-text-search.md#Full-Text-Search) is a traditional method for retrieving documents that contain specific terms or phrases by directly matching keywords within the text. It ranks results based on relevance, typically determined by factors such as term frequency and proximity. While semantic search excels at understanding intent and context, full-text search provides precision for exact keyword matching, making it a valuable complementary tool. The BM25 algorithm is a popular ranking method for full-text search, particularly useful in Retrieval-Augmented Generation (RAG).\n",
"\n",
"Since [Milvus 2.5](https://milvus.io/blog/introduce-milvus-2-5-full-text-search-powerful-metadata-filtering-and-more.md), full-text search is natively supported through the `Sparse-BM25` approach, by representing the BM25 algorithm as sparse vectors. Milvus accepts raw text as input and automatically converts it into sparse vectors stored in a specified field, eliminating the need for manual sparse embedding generation.\n",
"Since [Milvus 2.5](https://milvus.io/blog/introduce-milvus-2-5-full-text-search-powerful-metadata-filtering-and-more.md), full-text search is natively supported through the Sparse-BM25 approach, by representing the BM25 algorithm as sparse vectors. Milvus accepts raw text as input and automatically converts it into sparse vectors stored in a specified field, eliminating the need for manual sparse embedding generation.\n",
"\n",
"LangChain's integration with Milvus has also introduced this feature, simplifying the process of incorporating full-text search into RAG applications. By combining full-text search with semantic search with dense vectors, you can achieve a hybrid approach that leverages both semantic context from dense embeddings and precise keyword relevance from word matching. This integration enhances the accuracy, relevance, and user experience of search systems.\n",
"\n",
Expand Down Expand Up @@ -97,7 +97,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Install and start the Milvus server following this [guide](https://milvus.io/docs/install_standalone-docker-compose.md). And set your Milvus server `URI` (or optional `TOKEN`)"
"Specify your Milvus server `URI` (and optionally the `TOKEN`). For how to install and start the Milvus server following this [guide](https://milvus.io/docs/install_standalone-docker-compose.md). "
]
},
{
Expand Down Expand Up @@ -126,7 +126,7 @@
"from langchain_core.documents import Document\n",
"\n",
"docs = [\n",
" Document(page_content=\"I like apple\", metadata={\"category\": \"fruit\"}),\n",
" Document(page_content=\"I like this apple\", metadata={\"category\": \"fruit\"}),\n",
" Document(page_content=\"I like swimming\", metadata={\"category\": \"sport\"}),\n",
" Document(page_content=\"I like dogs\", metadata={\"category\": \"pets\"}),\n",
"]"
Expand Down Expand Up @@ -181,17 +181,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In the code above, we define an instance of `BM25BuiltInFunction` and pass it to the `Milvus` object. `BM25BuiltInFunction` is a lightweight wrapper class for the [`Function`](https://milvus.io/docs/manage-collections.md#Function) in Milvus.\n",
"In the code above, we define an instance of `BM25BuiltInFunction` and pass it to the `Milvus` object. `BM25BuiltInFunction` is a lightweight wrapper class for [`Function`](https://milvus.io/docs/manage-collections.md#Function) in Milvus.\n",
"\n",
"You can specify the input and output fields for this function in the parameters of the `BM25BuiltInFunction` instance by passing the following two field parameters:\n",
"You can specify the input and output fields for this function in the parameters of the `BM25BuiltInFunction`:\n",
"- `input_field_names` (str): The name of the input field, default is `text`. It indicates which field this function reads as input.\n",
"- `output_field_names` (str): The name of the output field, default is `sparse`. It indicates which field this function outputs the computed result to.\n",
"\n",
"Note that in the Milvus initialization parameters mentioned above, we also specify `vector_field=[\"dense\", \"sparse\"]`. Since the `sparse` field is the output field defined by the `BM25BuiltInFunction`, the other `dense` field will be automatically assigned to the output field of OpenAIEmbeddings.\n",
"Note that in the Milvus initialization parameters mentioned above, we also specify `vector_field=[\"dense\", \"sparse\"]`. Since the `sparse` field is taken as the output field defined by the `BM25BuiltInFunction`, the other `dense` field will be automatically assigned to the output field of OpenAIEmbeddings.\n",
"\n",
"In practice, especially when combining multiple embeddings or functions, we recommend clearly specifying the input and output fields for each function to avoid confusion.\n",
"In practice, especially when combining multiple embeddings or functions, we recommend explicitly specifying the input and output fields for each function to avoid ambiguity.\n",
"\n",
"In the following example, it specifies the input and output fields of BM25BuiltInFunction, and three vector fields, which makes it clear which field each built-in function and each vector embedding.\n"
"In the following example, we specify the input and output fields of `BM25BuiltInFunction` explicitly, making it clear which field the built-in function is for.\n"
]
},
{
Expand Down Expand Up @@ -241,16 +241,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we have three vector fields. Among them, `sparse` is used as the output field for `BM25BuiltInFunction`, while the other two, `dense1` and `dense2`, are automatically assigned as the output fields for the two `OpenAIEmbeddings` models. \n",
"In this example, we have three vector fields. Among them, `sparse` is used as the output field for `BM25BuiltInFunction`, while the other two, `dense1` and `dense2`, are automatically assigned as the output fields for the two `OpenAIEmbeddings` models (based on the order). \n",
"\n",
"In this way, you can define multiple vector fields and assign different combinations of embeddings or functions to them, enabling hybrid search.\n"
"In this way, you can define multiple vector fields and assign different combinations of embeddings or functions to them, to implement hybrid search."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When performing hybrid search, we just pass in the query text and optionally set the topK and reranker parameters. The `vectorstore` instance will automatically handle the vector embeddings and built-in functions and finally use a reranker to refine the results. From the user's end, we don't need to care about the underlying implementation details of the searching process."
"When performing hybrid search, we just need to pass in the query text and optionally set the topK and reranker parameters. The `vectorstore` instance will automatically handle the vector embeddings and built-in functions and finally use a reranker to refine the results. The underlying implementation details of the searching process are hidden from the user."
]
},
{
Expand All @@ -261,7 +261,7 @@
{
"data": {
"text/plain": [
"[Document(metadata={'pk': 454646931479251826, 'category': 'fruit'}, page_content='I like apple')]"
"[Document(metadata={'category': 'fruit', 'pk': 454646931479251897}, page_content='I like this apple')]"
]
},
"execution_count": 6,
Expand All @@ -271,15 +271,15 @@
],
"source": [
"vectorstore.similarity_search(\n",
" \"Do I like apple?\", k=1\n",
" \"Do I like apples?\", k=1\n",
") # , ranker_type=\"weighted\", ranker_params={\"weights\":[0.3, 0.3, 0.4]})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information about how to use the hybrid search, you can refer to the [Hybrid Search introduction](https://milvus.io/docs/multi-vector-search.md#Hybrid-Search) and this [LangChain Milvus hybrid search tutorial](https://milvus.io/docs/milvus_hybrid_search_retriever.md) ."
"For more information about hybrid search, you can refer to the [Hybrid Search introduction](https://milvus.io/docs/multi-vector-search.md#Hybrid-Search) and this [LangChain Milvus hybrid search tutorial](https://milvus.io/docs/milvus_hybrid_search_retriever.md) ."
]
},
{
Expand All @@ -288,7 +288,7 @@
"source": [
"### BM25 search without embedding\n",
"\n",
"If you want to perform lexical frequency-based full-text search using only a single BM25 function without using any embedding-based semantic similarity search, you can set the embedding parameter input to `None` and keep only the `builtin_function` parameter input as the BM25 function instance. For example: "
"If you want to perform only full-text search with BM25 function without using any embedding-based semantic search, you can set the embedding parameter to `None` and keep only the `builtin_function` specified as the BM25 function instance. The vector field only has \"sparse\" field. For example: "
]
},
{
Expand Down Expand Up @@ -331,11 +331,11 @@
"source": [
"## Customize analyzer\n",
"\n",
"Analyzers are essential tools in text processing that convert raw text into structured, searchable formats. They play a key role in enabling efficient indexing and retrieval by breaking down input text into tokens and refining these tokens through a combination of tokenizers and filters. For more information, you can refer [this guide](https://milvus.io/docs/analyzer-overview.md#Analyzer-Overview) to learn more about analyzers in Milvus.\n",
"Analyzers are essential in full-text search by breaking the sentence into tokens and performing lexical analysis like stemming and stop word removal. Analyzers are usually language-specific. You can refer to [this guide](https://milvus.io/docs/analyzer-overview.md#Analyzer-Overview) to learn more about analyzers in Milvus.\n",
"\n",
"Milvus supports two types of analyzers: **Built-in Analyzers** and **Custom Analyzers**. By default, the `BM25BuiltInFunction` will use the [default standard analyzer](https://milvus.io/docs/standard-analyzer.md), which makes it effective for most languages. \n",
"Milvus supports two types of analyzers: **Built-in Analyzers** and **Custom Analyzers**. By default, the `BM25BuiltInFunction` will use the [standard built-in analyzer](https://milvus.io/docs/standard-analyzer.md), which is the most basic analyzer that tokenizes the text with punctuation. \n",
"\n",
"However, if you want to use a different analyzer or customize the analyzer, you can pass in the `analyzer_params` parameter in the `BM25BuiltInFunction` initialization.\n",
"If you want to use a different analyzer or customize the analyzer, you can pass in the `analyzer_params` parameter in the `BM25BuiltInFunction` initialization.\n",
"\n"
]
},
Expand Down Expand Up @@ -387,7 +387,7 @@
{
"data": {
"text/plain": [
"{'auto_id': True, 'description': '', 'fields': [{'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535, 'enable_match': True, 'enable_analyzer': True, 'analyzer_params': {'tokenizer': 'standard', 'filter': ['lowercase', {'type': 'length', 'max': 40}, {'type': 'stop', 'stop_words': ['of', 'to']}]}}}, {'name': 'pk', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'dense', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 1536}}, {'name': 'sparse', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'is_function_output': True}, {'name': 'category', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}], 'enable_dynamic_field': False, 'functions': [{'name': 'bm25_function_333d45a1', 'description': '', 'type': <FunctionType.BM25: 1>, 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]}"
"{'auto_id': True, 'description': '', 'fields': [{'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535, 'enable_match': True, 'enable_analyzer': True, 'analyzer_params': {'tokenizer': 'standard', 'filter': ['lowercase', {'type': 'length', 'max': 40}, {'type': 'stop', 'stop_words': ['of', 'to']}]}}}, {'name': 'pk', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'dense', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 1536}}, {'name': 'sparse', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'is_function_output': True}, {'name': 'category', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}], 'enable_dynamic_field': False, 'functions': [{'name': 'bm25_function_de368e79', 'description': '', 'type': <FunctionType.BM25: 1>, 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]}"
]
},
"execution_count": 9,
Expand All @@ -410,8 +410,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Best practice of RAG\n",
"We have learned how to use the basic BM25 build-in function in LangChain and Milvus. Let's introduce the best practice of RAG in combination with this usage.\n",
"## Using Hybrid Search and Reranking in RAG\n",
"We have learned how to use the basic BM25 build-in function in LangChain and Milvus. Let's introduce an optimized RAG implementation with hybrid search and reranking.\n",
"\n",
"\n",
"![](../../../../images/advanced_rag/hybrid_and_rerank.png)\n",
Expand Down Expand Up @@ -617,7 +617,7 @@
{
"data": {
"text/plain": [
"'PAL (Program-aided Language models) and PoT (Program of Thoughts prompting) are approaches that involve using language models to generate programming language statements to solve natural language reasoning problems. This method offloads the solution step to a runtime, such as a Python interpreter, allowing for complex computation and reasoning to be handled externally. PAL and PoT rely on language models with strong coding skills to effectively perform these tasks.'"
"'PAL (Program-aided Language models) and PoT (Program of Thoughts prompting) are approaches that involve using language models to generate programming language statements to solve natural language reasoning problems. This method offloads the solution step to a runtime, such as a Python interpreter, allowing for complex computation and reasoning to be handled externally. PAL and PoT rely on language models with strong coding skills to effectively generate and execute these programming statements.'"
]
},
"execution_count": 15,
Expand Down

0 comments on commit 4fd8ad2

Please sign in to comment.