Skip to content

Commit

Permalink
Update azure-cosmos-db to reflect postgresql support (#281)
Browse files Browse the repository at this point in the history
* Update azure-cosmos-db to reflect postgresql support

* Update azure-cosmos-db.md

* Update azure-cosmos-db.md

* Add links to the retriever components for full pipeline examples
  • Loading branch information
bilgeyucel authored Oct 22, 2024
1 parent 13a867b commit f91f178
Showing 1 changed file with 29 additions and 69 deletions.
98 changes: 29 additions & 69 deletions integrations/azure-cosmos-db.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ authors:
github: deepset-ai
twitter: deepset_ai
linkedin: https://www.linkedin.com/company/deepset-ai/
pypi: https://pypi.org/project/mongodb-atlas-haystack/
repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mongodb_atlas
type: Document Store
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
logo: /logos/azure-cosmos-db.png
Expand All @@ -21,22 +19,30 @@ version: Haystack 2.0

- [Overview](#overview)
- [Installation](#installation)
- [Usage](#usage)
- [Usage (MongoDB)](#usage-mongodb)
- [Usage (PostgreSQL)](#usage-postgresql)

## Overview

[Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/introduction) is a fully managed NoSQL, relational, and vector database for modern app development. It offers single-digit millisecond response times, automatic and instant scalability, and guaranteed speed at any scale. It is the database that ChatGPT relies on to dynamically scale with high reliability and low maintenance.
[Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/introduction) is a fully managed NoSQL, relational, and vector database for modern app development. It offers single-digit millisecond response times, automatic and instant scalability, and guaranteed speed at any scale. It is the database that ChatGPT relies on to dynamically scale with high reliability and low maintenance. Haystack supports **MongoDB** and **PostgreSQL** clusters running on Azure Cosmos DB.

[Azure Cosmos DB for MongoDB](https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/introduction) makes it easy to use Azure Cosmos DB as if it were a MongoDB database. You can use your existing MongoDB skills and continue to use your favorite MongoDB drivers, SDKs, and tools by pointing your application to the connection string for your account using the API for MongoDB. Learn more in the [Azure Cosmos DB for MongoDB documentation](https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/).

[Azure Cosmos DB for PostgreSQL](https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql/introduction) is a managed service for PostgreSQL extended with the Citus open source superpower of distributed tables. This superpower enables you to build highly scalable relational apps. You can start building apps on a single node cluster, as you would with PostgreSQL. As your app's scalability and performance requirements grow, you can seamlessly scale to multiple nodes by transparently distributing your tables. Learn more in the [Azure Cosmos DB for PostgreSQL documentation](https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql/).

## Installation

It's possible to connect to your MongoDB cluster in Azure Cosmos DB through the `MongoDBAtlasDocumentStore`. For that, install the `mongo-atlas-haystack` integration.
It's possible to connect to your **MongoDB** cluster on Azure Cosmos DB through the `MongoDBAtlasDocumentStore`. For that, install the `mongo-atlas-haystack` integration.
```bash
pip install mongodb-atlas-haystack
```

## Usage
If you want to connect to the **PostgreSQL** cluster on Azure Cosmos DB, install the `pgvector-haystack` integration.
```bash
pip install pgvector-haystack
```

## Usage (MongoDB)

To use Azure Cosmos DB for MongoDB with `MongoDBAtlasDocumentStore`, you'll need to set up an Azure Cosmos DB for MongoDB vCore cluster through the Azure portal. For a step-by-step guide, refer to [Quickstart: Azure Cosmos DB for MongoDB vCore](https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/quickstart-portal).

Expand Down Expand Up @@ -70,75 +76,29 @@ document_store = MongoDBAtlasDocumentStore(

document_store.write_documents([Document(content="this is my first doc")])
```
Now, you can go ahead and build your Haystack pipeline using `MongoDBAtlasEmbeddingRetriever`. Check out the [MongoDBAtlasEmbeddingRetriever docs](https://docs.haystack.deepset.ai/docs/mongodbatlasembeddingretriever) for the full pipeline example.

### Example pipelines
## Usage (PostgreSQL)

Here is some example code of an end-to-end RAG app built on Azure Cosmos DB: one indexing pipeline that embeds the documents,
and a generative pipeline that can be used for question answering.
To use Azure Cosmos DB for PostgreSQL with `PgvectorDocumentStore`, you'll need to set up a PostgreSQL cluster through the Azure portal. For a step-by-step guide, refer to [Quickstart: Azure Cosmos DB for PostgreSQL](https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql/quickstart-create-portal).

After setting up your cluster, configure the `PG_CONN_STR` environment variable using the connection string for your cluster. You can find the connection string by following the instructions [here](https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql/quickstart-connect-psql). The format should look like this:

```python
from haystack import Pipeline, Document
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.writers import DocumentWriter
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
import os

# Create some example documents
documents = [
Document(content="My name is Jean and I live in Paris."),
Document(content="My name is Mark and I live in Berlin."),
Document(content="My name is Giorgio and I live in Rome."),
]
os.environ['PG_CONN_STR'] = "host=c-<cluster>.<uniqueID>.postgres.cosmos.azure.com port=5432 dbname=citus user=citus password={your_password} sslmode=require"
```

document_store = MongoDBAtlasDocumentStore(
database_name="quickstartDB", # your db name
collection_name="sampleCollection", # your collection name
vector_search_index="haystack-test", # your cluster name
)
Once this is done, you can initialize the [`PgvectorDocumentStore`](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore) in Haystack with the appropriate configuration.

# Define some more components
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
doc_embedder = SentenceTransformersDocumentEmbedder(model="intfloat/e5-base-v2")
query_embedder = SentenceTransformersTextEmbedder(model="intfloat/e5-base-v2")

# Pipeline that ingests document for retrieval
indexing_pipe = Pipeline()
indexing_pipe.add_component(instance=doc_embedder, name="doc_embedder")
indexing_pipe.add_component(instance=doc_writer, name="doc_writer")

indexing_pipe.connect("doc_embedder.documents", "doc_writer.documents")
indexing_pipe.run({"doc_embedder": {"documents": documents}})

# Build a RAG pipeline with a Retriever to get documents relevant to
# the query, a PromptBuilder to create a custom prompt and the OpenAIGenerator (LLM)
prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
\nQuestion: {{question}}
\nAnswer:
"""
rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=query_embedder, name="query_embedder")
rag_pipeline.add_component(instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("embedding_retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

# Ask a question on the data you just added.
question = "Where does Mark live?"
result = rag_pipeline.run(
{
"query_embedder": {"text": question},
"prompt_builder": {"question": question},
}
```python
document_store = PgvectorDocumentStore(
table_name="haystack_documents",
embedding_dimension=1024,
vector_function="cosine_similarity",
search_strategy="hnsw",
recreate_table=True,
)
print(result)
```
Now, you can go ahead and build your Haystack pipeline using `PgvectorEmbeddingRetriever` and `PgvectorKeywordRetriever`. Check out the [PgvectorEmbeddingRetriever docs](https://docs.haystack.deepset.ai/docs/pgvectorembeddingretriever) for the full pipeline example.

0 comments on commit f91f178

Please sign in to comment.