Merge pull request milvus-io#1437 from wxywb/master

Add ColPALI and HDBSCAN notebook
zc277584121 · Oct 16, 2024 · bd9c0f4 · bd9c0f4
2 parents 5386218 + 836492d
commit bd9c0f4
Show file tree

Hide file tree

Showing 3 changed files with 816 additions and 0 deletions.
diff --git a/bootcamp/tutorials/quickstart/hdbscan_clustering_with_milvus.ipynb b/bootcamp/tutorials/quickstart/hdbscan_clustering_with_milvus.ipynb
@@ -0,0 +1,340 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/milvus-io/bootcamp/blob/master/bootcamp/tutorials/quickstart/hdbscan_clustering_with_milvus.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>   <a href=\"bootcamp/bootcamp/tutorials/quickstart/hdbscan_clustering_with_milvus.ipynb at master · milvus-io/bootcamp\" target=\"_blank\">\n",
+    "    <img src=\"https://img.shields.io/badge/View%20on%20GitHub-555555?style=flat&logo=github&logoColor=white\" alt=\"GitHub Repository\"/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# HDBSCAN Clustering with Milvus\n",
+    "Data can be transformed into embeddings using deep learning models, which capture meaningful representations of the original data. By applying an unsupervised clustering algorithm, we can group similar data points together based on their inherent patterns. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a widely used clustering algorithm that efficiently groups data points by analyzing their density and distance. It is particularly useful for discovering clusters of varying shapes and sizes. In this notebook, we will use HDBSCAN with Milvus, a high-performance vector database, to cluster data points into distinct groups based on their embeddings.\n",
+    "\n",
+    "HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that relies on calculating distances between data points in embedding space. These embeddings, created by deep learning models, represent data in a high-dimensional form. To group similar data points, HDBSCAN determines their proximity and density, but efficiently computing these distances, especially for large datasets, can be challenging.\n",
+    "\n",
+    "Milvus, a high-performance vector database, optimizes this process by storing and indexing embeddings, allowing for fast retrieval of similar vectors. When used together, HDBSCAN and Milvus enable efficient clustering of large-scale datasets in embedding space.\n",
+    "\n",
+    "In this notebook, we will use the BGE-M3 embedding model to extract embeddings from a news headline dataset, utilize Milvus to efficiently calculate distances between embeddings to aid HDBSCAN in clustering, and then visualize the results for analysis using the UMAP method. This notebook is a Milvus adapation of [Dylan Castillo's article](https://dylancastillo.co/posts/clustering-documents-with-openai-langchain-hdbscan.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preparation\n",
+    "download news dataset from https://www.kaggle.com/datasets/dylanjcastillo/news-headlines-2024/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install \"pymilvus[model]\"\n",
+    "!pip install hdbscan\n",
+    "!pip install plotly\n",
+    "!pip install umap-learn"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Download Data\n",
+    "Download news dataset from https://www.kaggle.com/datasets/dylanjcastillo/news-headlines-2024/, extract `news_data_dedup.csv` and put it into current directory.\n",
+    "\n",
+    "## Extract Embeddings to Milvus\n",
+    "We will create a collection using Milvus, and extract dense embeddings using BGE-M3 model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2024-10-16 11:03:00.418817: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
+      "2024-10-16 11:03:00.432515: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "2024-10-16 11:03:00.446648: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "2024-10-16 11:03:00.450824: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "2024-10-16 11:03:00.462393: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "2024-10-16 11:03:01.128046: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "7cf59571682e432aa4c9e8b1b102012b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Inference Embeddings: 100%|██████████| 55/55 [00:12<00:00,  4.47it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "from dotenv import load_dotenv\n",
+    "from pymilvus.model.hybrid import BGEM3EmbeddingFunction\n",
+    "from pymilvus import FieldSchema, Collection, connections, CollectionSchema, DataType\n",
+    "\n",
+    "load_dotenv()\n",
+    "\n",
+    "df = pd.read_csv(\"news_data_dedup.csv\")\n",
+    "\n",
+    "\n",
+    "docs = [\n",
+    "    f\"{title}\\n{description}\" for title, description in zip(df.title, df.description)\n",
+    "]\n",
+    "ef = BGEM3EmbeddingFunction()\n",
+    "\n",
+    "embeddings = ef(docs)[\"dense\"]\n",
+    "\n",
+    "connections.connect(uri=\"milvus.db\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> - If you only need a local vector database for small scale data or prototyping, setting the uri as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n",
+    "> - If you have large scale of data, say more than a million vectors, you can set up a more performant Milvus server on [Docker or Kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server address and port as your uri, e.g.`http://localhost:19530`. If you enable the authentication feature on Milvus, use \"<your_username>:<your_password>\" as the token, otherwise don't set the token.\n",
+    "> - If you use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and API key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#cluster-details) in Zilliz Cloud."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fields = [\n",
+    "    FieldSchema(\n",
+    "        name=\"id\", dtype=DataType.INT64, is_primary=True, auto_id=True\n",
+    "    ),  # Primary ID field\n",
+    "    FieldSchema(\n",
+    "        name=\"embedding\", dtype=DataType.FLOAT_VECTOR, dim=1024\n",
+    "    ),  # Float vector field (embedding)\n",
+    "    FieldSchema(\n",
+    "        name=\"text\", dtype=DataType.VARCHAR, max_length=65535\n",
+    "    ),  # Float vector field (embedding)\n",
+    "]\n",
+    "\n",
+    "schema = CollectionSchema(fields=fields, description=\"Embedding collection\")\n",
+    "\n",
+    "collection = Collection(name=\"news_data\", schema=schema)\n",
+    "\n",
+    "for doc, embedding in zip(docs, embeddings):\n",
+    "    collection.insert({\"text\": doc, \"embedding\": embedding})\n",
+    "    print(doc)\n",
+    "\n",
+    "index_params = {\"index_type\": \"FLAT\", \"metric_type\": \"L2\", \"params\": {}}\n",
+    "\n",
+    "collection.create_index(field_name=\"embedding\", index_params=index_params)\n",
+    "\n",
+    "collection.flush()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Construct the Distance Matrix for HDBSCAN\n",
+    "HDBSCAN requires calculating distances between points for clustering, which can be computationally intensive. Since distant points have less influence on clustering assignments, we can improve efficiency by calculating the top-k nearest neighbors. In this example, we use the FLAT index, but for large-scale datasets, Milvus supports more advanced indexing methods to accelerate the search process.\n",
+    "Firstly, we need to get a iterator to iterate the Milvus collection we previously created."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import hdbscan\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import plotly.express as px\n",
+    "from umap import UMAP\n",
+    "from pymilvus import Collection\n",
+    "\n",
+    "collection = Collection(name=\"news_data\")\n",
+    "collection.load()\n",
+    "\n",
+    "iterator = collection.query_iterator(\n",
+    "    batch_size=10, expr=\"id > 0\", output_fields=[\"id\", \"embedding\"]\n",
+    ")\n",
+    "\n",
+    "search_params = {\n",
+    "    \"metric_type\": \"L2\",\n",
+    "    \"params\": {\"nprobe\": 10},\n",
+    "}  # L2 is Euclidean distance\n",
+    "\n",
+    "ids = []\n",
+    "dist = {}\n",
+    "\n",
+    "embeddings = []"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will iterate all embeddings in the Milvus collection. For each embedding, we will search its top-k neighbors in the same collection, get their ids and distances. Then we also need to build a dictionary to map original ID to a continuous index in the distance matrix. When finished, we need to create a distance matrix which initialized with all elements as infinity and fill the elements we searched. In this way, the distance between far away points will be ignored. Finally we use HDBSCAN library to cluster the points using the distance matrix we created. We need to set metric to 'precomputed' to indicate the data is distance matrix rather than origianl embeddings."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "while True:\n",
+    "    batch = iterator.next()\n",
+    "    batch_ids = [data[\"id\"] for data in batch]\n",
+    "    ids.extend(batch_ids)\n",
+    "\n",
+    "    query_vectors = [data[\"embedding\"] for data in batch]\n",
+    "    embeddings.extend(query_vectors)\n",
+    "\n",
+    "    results = collection.search(\n",
+    "        data=query_vectors,\n",
+    "        limit=50,\n",
+    "        anns_field=\"embedding\",\n",
+    "        param=search_params,\n",
+    "        output_fields=[\"id\"],\n",
+    "    )\n",
+    "    for i, batch_id in enumerate(batch_ids):\n",
+    "        dist[batch_id] = []\n",
+    "        for result in results[i]:\n",
+    "            dist[batch_id].append((result.id, result.distance))\n",
+    "\n",
+    "    if len(batch) == 0:\n",
+    "        break\n",
+    "\n",
+    "ids2index = {}\n",
+    "\n",
+    "for id in dist:\n",
+    "    ids2index[id] = len(ids2index)\n",
+    "\n",
+    "dist_metric = np.full((len(ids), len(ids)), np.inf, dtype=np.float64)\n",
+    "\n",
+    "for id in dist:\n",
+    "    for result in dist[id]:\n",
+    "        dist_metric[ids2index[id]][ids2index[result[0]]] = result[1]\n",
+    "\n",
+    "h = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=3, metric=\"precomputed\")\n",
+    "hdb = h.fit(dist_metric)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After this, the HDBSCAN clustering is finished. We can get some data and show its cluster. Note some data will not be assigned to any cluster, which means they are noise, because they are located at some sparse region."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Clusters Visualization using UMAP \n",
+    "We have already clustered the data using HDBSCAN and can get the labels for each data point. However using some visualization techniques, we can get the whole picture of the clusters for a intuitional analysis. Now we are going the use UMAP to visualize the clusters. UMAP is a efficient methodused for dimensionality reduction, preserving the structure of high-dimensional data while projecting it into a lower-dimensional space for visualization or further analysis. With it, we can visualize original high-dimensional data in 2D or 3D space, and see the clusters clearly.\n",
+    "Here again, we iterate the data points and get the id and text for original data, then we use ploty to plot the data points with these metainfo in a figure, and use different colors to represent different clusters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import plotly.io as pio\n",
+    "\n",
+    "pio.renderers.default = \"notebook\"\n",
+    "\n",
+    "umap = UMAP(n_components=2, random_state=42, n_neighbors=80, min_dist=0.1)\n",
+    "\n",
+    "df_umap = (\n",
+    "    pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=[\"x\", \"y\"])\n",
+    "    .assign(cluster=lambda df: hdb.labels_.astype(str))\n",
+    "    .query('cluster != \"-1\"')\n",
+    "    .sort_values(by=\"cluster\")\n",
+    ")\n",
+    "iterator = collection.query_iterator(\n",
+    "    batch_size=10, expr=\"id > 0\", output_fields=[\"id\", \"text\"]\n",
+    ")\n",
+    "\n",
+    "ids = []\n",
+    "texts = []\n",
+    "\n",
+    "while True:\n",
+    "    batch = iterator.next()\n",
+    "    if len(batch) == 0:\n",
+    "        break\n",
+    "    batch_ids = [data[\"id\"] for data in batch]\n",
+    "    batch_texts = [data[\"text\"] for data in batch]\n",
+    "    ids.extend(batch_ids)\n",
+    "    texts.extend(batch_texts)\n",
+    "\n",
+    "show_texts = [texts[i] for i in df_umap.index]\n",
+    "\n",
+    "df_umap[\"hover_text\"] = show_texts\n",
+    "fig = px.scatter(\n",
+    "    df_umap, x=\"x\", y=\"y\", color=\"cluster\", hover_data={\"hover_text\": True}\n",
+    ")\n",
+    "fig.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![image](../../../images/hdbscan_clustering_with_milvus.png)\n",
+    "\n",
+    "Here, we demonstrate that the data is well clustered, and you can hover over the points to check the text they represent. With this notebook, we hope you learn how to use HDBSCAN to cluster embeddings with Milvus efficiently, which can also be applied to other types of data. Combined with large language models, this approach allows for deeper analysis of your data at a large scale."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}