This project demonstrates the creation of an image similarity search application utilizing Azure Cosmos DB for MongoDB vcore as a vector database and Azure AI Vision for generating embeddings. It serves as a starting point that can be used for the development of more sophisticated vector search solutions.
In this sample application, we will explore image similarity search on Azure Cosmos DB for MongoDB vcore using the SemArt Dataset. This dataset contains approximately 21k paintings gathered from the Web Gallery of Art. Each painting comes with various attributes, like a title, description, and the name of the artist.
Before you start, ensure that you have the following prerequisites installed and configured:
-
An Azure subscription - Create an Azure free account or an Azure for Students account.
-
An Azure AI Vision resource or a multi-service resource for Azure AI services - It is recommended to use the standard tier because the free tier allows only 20 transactions per minute.
The multi-modal embeddings APIs are available in the following regions: East US, France Central, Korea Central, North Europe, Southeast Asia, West Europe, West US.
-
An Azure Storage account - Create an Azure Storage account using the Azure CLI.
-
An Azure Cosmos DB for Mongo vcore cluster - Create an Azure Cosmos DB for MongoDB vcore in the Azure portal
-
Python 3.10, Visual Studio Code, Jupyter Notebook, and Jupyter Extension for Visual Studio Code.
Before running the Python scripts and Jupyter Notebooks, you should:
-
Clone this repository to to have it locally available.
-
Download the SemArt Dataset into the semart_dataset directory.
-
Create a virtual environment and activate it.
-
Install the required Python packages using the following command:
pip install -r requirements.txt
-
Generate a .env file by using the provided .env.sample file from this repository.
Sample | Description |
---|---|
Data Preprocessing | Cleans up the SemArt Dataset and creates the final dataset that is utilized in our application. |
Embeddings Generation | Generates vector embeddings for the images in the dataset using the Azure AI Vision Vectorize Image API and creates the final dataset that is utilized in the image search application. |
Upload images to Azure Blob Storage | Creates an Azure Blob Storage container and uploads the paintings' images. |
Insert data to Azure Cosmos DB for Mongo vcore | Creates a table in the Azure Cosmos DB for MongoDb vcore cluster and populates it with data from the dataset. |
Exact nearest neighbor search | Demonstrates text-to-image and image-to-image search approaches, along with a simple method for metadata filtering. |
Feel free to experiment with the project and modify the code to meet your specific use cases and requirements!