diff --git a/bootcamp/tutorials/integration/build_RAG_with_milvus_and_deepseek.ipynb b/bootcamp/tutorials/integration/build_RAG_with_milvus_and_deepseek.ipynb new file mode 100644 index 000000000..07bdf06ed --- /dev/null +++ b/bootcamp/tutorials/integration/build_RAG_with_milvus_and_deepseek.ipynb @@ -0,0 +1,608 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "collapsed": false + }, + "source": [ + "\"Open \n", + " \"GitHub" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Build RAG with Milvus and DeepSeek\n", + "\n", + "[DeepSeek](https://www.deepseek.com/) enables developers to build and scale AI applications with high-performance language models. It offers efficient inference, flexible APIs, and advanced Mixture-of-Experts (MoE) architectures for robust reasoning and retrieval tasks. \n", + "\n", + "In this tutorial, we’ll show you how to build a Retrieval-Augmented Generation (RAG) pipeline using Milvus and DeepSeek." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Preparation\n", + "### Dependencies and Environment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "! pip install --upgrade pymilvus[model] openai requests tqdm" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false + }, + "source": [ + "> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "DeepSeek enables the OpenAI-style API. You can login to its official website and prepare the [api key](https://platform.deepseek.com/api_keys) `DEEPSEEK_API_KEY` as an environment variable." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "ExecuteTime": { + "end_time": "2025-01-02T07:37:28.876916Z", + "start_time": "2025-01-02T07:37:28.868684Z" + }, + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "os.environ[\"DEEPSEEK_API_KEY\"] = \"***********\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare the data\n", + "\n", + "We use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge in our RAG, which is a good data source for a simple RAG pipeline.\n", + "\n", + "Download the zip file and extract documents to the folder `milvus_docs`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip\n", + "! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We load all markdown files from the folder `milvus_docs/en/faq`. For each document, we just simply use \"# \" to separate the content in the file, which can roughly separate the content of each main part of the markdown file." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "ExecuteTime": { + "end_time": "2025-01-02T07:37:35.050694Z", + "start_time": "2025-01-02T07:37:35.044643Z" + } + }, + "outputs": [], + "source": [ + "from glob import glob\n", + "\n", + "text_lines = []\n", + "\n", + "for file_path in glob(\"milvus_docs/en/faq/*.md\", recursive=True):\n", + " with open(file_path, \"r\") as file:\n", + " file_text = file.read()\n", + "\n", + " text_lines += file_text.split(\"# \")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare the LLM and Embedding Model\n", + "\n", + "DeepSeek enables the OpenAI-style API, and you can use the same API with minor adjustments to call the LLM." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "ExecuteTime": { + "end_time": "2025-01-02T07:37:36.799239Z", + "start_time": "2025-01-02T07:37:36.773324Z" + } + }, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "\n", + "deepseek_client = OpenAI(\n", + " api_key=os.environ[\"DEEPSEEK_API_KEY\"],\n", + " base_url=\"https://api.deepseek.com\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a embedding model to generate text embeddings using the `milvus_model`. We use the `DefaultEmbeddingFunction` model as an example, which is a pre-trained and lightweight embedding model." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "ExecuteTime": { + "end_time": "2025-01-02T07:37:38.654065Z", + "start_time": "2025-01-02T07:37:38.486294Z" + } + }, + "outputs": [], + "source": [ + "from pymilvus import model as milvus_model\n", + "\n", + "embedding_model = milvus_model.DefaultEmbeddingFunction()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Generate a test embedding and print its dimension and first few elements." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "ExecuteTime": { + "end_time": "2025-01-02T07:37:40.127301Z", + "start_time": "2025-01-02T07:37:39.975947Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "768\n", + "[-0.04836066 0.07163023 -0.01130064 -0.03789345 -0.03320649 -0.01318448\n", + " -0.03041712 -0.02269499 -0.02317863 -0.00426028]\n" + ] + } + ], + "source": [ + "test_embedding = embedding_model.encode_queries([\"This is a test\"])[0]\n", + "embedding_dim = len(test_embedding)\n", + "print(embedding_dim)\n", + "print(test_embedding[:10])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load data into Milvus" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create the Collection" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "ExecuteTime": { + "end_time": "2025-01-02T07:37:54.533232Z", + "start_time": "2025-01-02T07:37:54.524949Z" + } + }, + "outputs": [], + "source": [ + "from pymilvus import MilvusClient\n", + "\n", + "milvus_client = MilvusClient(uri=\"./milvus_demo.db\")\n", + "\n", + "collection_name = \"my_rag_collection\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false + }, + "source": [ + "> As for the argument of `MilvusClient`:\n", + "> - Setting the `uri` as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n", + "> - If you have large scale of data, you can set up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.\n", + "> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check if the collection already exists and drop it if it does." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "ExecuteTime": { + "end_time": "2025-01-02T07:37:57.334098Z", + "start_time": "2025-01-02T07:37:57.327142Z" + } + }, + "outputs": [], + "source": [ + "if milvus_client.has_collection(collection_name):\n", + " milvus_client.drop_collection(collection_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a new collection with specified parameters. \n", + "\n", + "If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "ExecuteTime": { + "end_time": "2025-01-02T07:38:00.902007Z", + "start_time": "2025-01-02T07:38:00.383848Z" + } + }, + "outputs": [], + "source": [ + "milvus_client.create_collection(\n", + " collection_name=collection_name,\n", + " dimension=embedding_dim,\n", + " metric_type=\"IP\", # Inner product distance\n", + " consistency_level=\"Strong\", # Strong consistency level\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Insert data\n", + "Iterate through the text lines, create embeddings, and then insert the data into Milvus.\n", + "\n", + "Here is a new field `text`, which is a non-defined field in the collection schema. It will be automatically added to the reserved JSON dynamic field, which can be treated as a normal field at a high level." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "ExecuteTime": { + "end_time": "2025-01-02T07:38:11.018550Z", + "start_time": "2025-01-02T07:38:01.996847Z" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Creating embeddings: 0%| | 0/72 [00:00 tags to provide an answer to the question enclosed in tags.\n", + "\n", + "{context}\n", + "\n", + "\n", + "{question}\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use the `deepseek-chat` model provided by DeepSeek to generate a response based on the prompts." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "pycharm": { + "name": "#%%\n" + }, + "ExecuteTime": { + "end_time": "2025-01-02T07:53:44.744085Z", + "start_time": "2025-01-02T07:53:40.101321Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "In Milvus, data is stored in two main categories: inserted data and metadata.\n", + "\n", + "1. **Inserted Data**: This includes vector data, scalar data, and collection-specific schema. The inserted data is stored in persistent storage as incremental logs. Milvus supports various object storage backends for this purpose, such as MinIO, AWS S3, Google Cloud Storage (GCS), Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage (COS).\n", + "\n", + "2. **Metadata**: Metadata is generated within Milvus and is specific to each Milvus module. This metadata is stored in etcd, a distributed key-value store.\n", + "\n", + "Additionally, when data is inserted, it is first loaded into a message queue, and Milvus returns success at this stage. The data is then written to persistent storage as incremental logs by the data node. If the `flush()` function is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n" + ] + } + ], + "source": [ + "response = deepseek_client.chat.completions.create(\n", + " model=\"deepseek-chat\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n", + " {\"role\": \"user\", \"content\": USER_PROMPT},\n", + " ],\n", + ")\n", + "print(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Great! We have successfully built a RAG pipeline with Milvus and DeepSeek." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}