Update llm app readmes (#7536)

Co-authored-by: berkecanrizai <[email protected]> GitOrigin-RevId: 697024be46cecfa892e9a40fddbb4ebbed1f6fcd
pathwaycom · Oct 29, 2024 · c31390e · c31390e
1 parent 18af9a6
commit c31390e
Show file tree

Hide file tree

Showing 18 changed files with 483 additions and 203 deletions.
diff --git a/examples/pipelines/adaptive-rag/README.md b/examples/pipelines/adaptive-rag/README.md
@@ -32,27 +32,106 @@ We also set `strict_prompt=True`. This adjusts the prompt with additional instru
 
 We encourage you to check the implementation of `answer_with_geometric_rag_strategy_from_index`.
 
-## Modifying the code
+## Customizing the pipeline
 
-Under the main function, we define:
-- input folders
+The code can be modified by changing the `app.yaml` configuration file. To read more about YAML files used in Pathway templates, read [our guide](https://pathway.com/developers/user-guide/llm-xpack/yaml-templates).
+
+In the `app.yaml` file we define:
+- input connectors
 - LLM
 - embedder
 - index
-- host and port to run the app
-- run options (caching, cache folder)
+and any of these can be replaced or, if no longer needed, removed. For components that can be used check 
+Pathway [LLM xpack](https://pathway.com/developers/user-guide/llm-xpack/overview), or you can implement your own.
+
+You can also check our other templates - [demo-question-answering](https://github.com/pathwaycom/llm-app/tree/main/examples/pipelines/demo-question-answering), 
+[Multimodal RAG](https://github.com/pathwaycom/llm-app/tree/main/examples/pipelines/gpt_4o_multimodal_rag) or 
+[Private RAG](https://github.com/pathwaycom/llm-app/tree/main/examples/pipelines/private-rag). As all of these only differ 
+in the YAML configuration file, you can also use them as an inspiration for your custom pipeline.
+
+Here some examples of what can be modified.
+
+### LLM Model
+
+You can choose any of the GPT-3.5 Turbo, GPT-4, or GPT-4 Turbo models proposed by Open AI.
+You can find the whole list on their [models page](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo).
+
+You simply need to change the `model` to the one you want to use:
+```yaml
+$llm: !pw.xpacks.llm.llms.OpenAIChat
+  model: "gpt-3.5-turbo"
+  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
+    max_retries: 6
+  cache_strategy: !pw.udfs.DiskCache
+  temperature: 0.05
+  capacity: 8
+```
+
+The default model is `gpt-3.5-turbo`
+
+You can also use different provider, by using different class from [Pathway LLM xpack](https://pathway.com/developers/user-guide/llm-xpack/overview),
+e.g. here is configuration for locally run Mistral model.
+
+```yaml
+$llm: !pw.xpacks.llm.llms.LiteLLMChat
+  model: "ollama/mistral"
+  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
+    max_retries: 6
+  cache_strategy: !pw.udfs.DiskCache
+  temperature: 0
+  top_p: 1
+  api_base: "http://localhost:11434"
+```
+
+### Webserver
+
+You can configure the host and the port of the webserver.
+Here is the default configuration:
+```yaml
+host: "0.0.0.0"
+port: 8000
+```
+
+### Cache
+
+You can configure whether you want to enable cache, to avoid repeated API accesses, and where the cache is stored.
+Default values:
+```yaml
+with_cache: True
+cache_backend: !pw.persistence.Backend.filesystem
+  path: ".Cache"
+```
+
+### Data sources
+
+You can configure the data sources by changing `$sources` in `app.yaml`.
+You can add as many data sources as you want. You can have several sources of the same kind, for instance, several local sources from different folders.
+The sections below describe how to configure local, Google Drive and Sharepoint source, but you can use any input [connector](https://pathway.com/developers/user-guide/connecting-to-data/connectors) from Pathway package.
+
+By default, the app uses a local data source to read documents from the `data` folder.
+
+#### Local Data Source
+
+The local data source is configured by using map with tag `!pw.io.fs.read`. Then set `path` to denote the path to a folder with files to be indexed.
+
+#### Google Drive Data Source
+
+The Google Drive data source is enabled by using map with tag `!pw.io.gdrive.read`. The map must contain two main parameters:
+- `object_id`, containing the ID of the folder that needs to be indexed. It can be found from the URL in the web interface, where it's the last part of the address. For example, the publicly available demo folder in Google Drive has the URL `https://drive.google.com/drive/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`. Consequently, the last part of this address is `1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`, hence this is the `object_id` you would need to specify.
+- `service_user_credentials_file`, containing the path to the credentials files for the Google [service account](https://cloud.google.com/iam/docs/service-account-overview). To get more details on setting up the service account and getting credentials, you can also refer to [this tutorial](https://pathway.com/developers/user-guide/connectors/gdrive-connector/#setting-up-google-drive).
+
+Besides, to speed up the indexing process you may want to specify the `refresh_interval` parameter, denoted by an integer number of seconds. It corresponds to the frequency between two sequential folder scans. If unset, it defaults to 30 seconds.
 
-By default, we used OpenAI `gpt-3.5-turbo`. However, as done in the showcase, it is possible to use any LLM, including locally deployed LLMs.
+For the full list of the available parameters, please refer to the Google Drive connector [documentation](https://pathway.com/developers/api-docs/pathway-io/gdrive#pathway.io.gdrive.read).
 
-If you are interested in building this app in a fully private & local setup, check out the [private RAG example](../private-rag/README.md) that uses `Mistral 7B` as the LLM with a local embedding model.
+#### SharePoint Data Source
 
-You can modify any of the used components by checking the options from: `from pathway.xpacks.llm import embedders, llms, parsers, splitters`.
-It is also possible to easily create new components by extending the [`pw.UDF`](https://pathway.com/developers/user-guide/data-transformation/user-defined-functions) class and implementing the `__wrapped__` function.
+This data source requires Scale or Enterprise [license key](https://pathway.com/pricing) - you can obtain free Scale key on [Pathway website](https://pathway.com/get-license).
 
-To see the setup used in our work, check [the showcase](https://pathway.com/developers/templates/private-rag-ollama-mistral).
+To use it, set the map tag to be `!pw.xpacks.connectors.sharepoint.read`, and then provide values of `url`, `tenant`, `client_id`, `cert_path`, `thumbprint` and `root_path`. To read about the meaning of these arguments, check the Sharepoint connector [documentation](https://pathway.com/developers/api-docs/pathway-xpacks-sharepoint/#pathway.xpacks.connectors.sharepoint.read).
 
 ## Running the app
-To run the app you need to set your OpenAI API key, by setting the environmental variable `OPENAI_API_KEY` or creating an `.env` file in this directory with line `OPENAI_API_KEY=sk-...`. If you modify the code to use another LLM provider, you may need to set a relevant API key.
+To run the app, depending on the configuration, you may need to set up environmntal variables with LLM provider keys. By default, this template  uses OpenAI API, so to run it you need to set `OPENAI_API_KEY` environmental key or create an `.env` file in this directory with your key: `OPENAI_API_KEY=sk-...`. If you modify the code to use another LLM provider, you may need to set a relevant API key.
 
 ### With Docker
 In order to let the pipeline get updated with each change in local files, you need to mount the folder onto the docker. The following commands show how to do that.

diff --git a/examples/pipelines/adaptive-rag/app.py b/examples/pipelines/adaptive-rag/app.py
@@ -26,12 +26,17 @@ class App(BaseModel):
     port: int = 8000
 
     with_cache: bool = True
+    cache_backend: InstanceOf[pw.persistence.Backend] = (
+        pw.persistence.Backend.filesystem("./Cache")
+    )
     terminate_on_error: bool = False
 
     def run(self) -> None:
         server = QASummaryRestServer(self.host, self.port, self.question_answerer)
         server.run(
-            with_cache=self.with_cache, terminate_on_error=self.terminate_on_error
+            with_cache=self.with_cache,
+            cache_backend=self.cache_backend,
+            terminate_on_error=self.terminate_on_error,
         )
 
     model_config = ConfigDict(extra="forbid")

diff --git a/examples/pipelines/adaptive-rag/app.yaml b/examples/pipelines/adaptive-rag/app.yaml
@@ -63,9 +63,14 @@ question_answerer: !pw.xpacks.llm.question_answering.AdaptiveRAGQuestionAnswerer
   strict_prompt: true
 
 
-# Change host and port by uncommenting these files
+# Change host and port by uncommenting these lines
 # host: "0.0.0.0"
 # port: 8000
 
+# Cache configuration
 # with_cache: true
+# cache_backend: !pw.persistence.Backend.filesystem
+#  path: ".Cache"
+
+# Set `terminate_on_error` to true if you want the program to terminate whenever any error is encountered
 # terminate_on_error: false
diff --git a/examples/pipelines/demo-document-indexing/README.md b/examples/pipelines/demo-document-indexing/README.md
@@ -39,69 +39,84 @@ Finally, the embeddings are indexed with the capabilities of Pathway's machine-l
 
 This folder contains several objects:
 - `main.py`, the pipeline code using Pathway and written in Python;
-- `sources_configuration.yaml`, the file containing configuration stubs for the data sources. It needs to be customized if you want to use the Google Drive data source or to change the filesystem directories that will be indexed;
+- `app.yaml`, the file containing configuration of the pipeline, like embedding model, sources, or the server address;
 - `requirements.txt`, the textfile denoting the pip dependencies for running this pipeline. It can be passed to `pip install -r ...` to install everything that is needed to launch the pipeline locally;
 - `Dockerfile`, the Docker configuration for running the pipeline in the container;
 - `docker-compose.yml`, the docker-compose configuration for running the pipeline along with the chat UI;
 - `.env`, a short environment variables configuration file where the OpenAI key must be stored;
 - `files-for-indexing/`, a folder with exemplary files that can be used for the test runs.
 
-## OpenAPI Key Configuration
+## Customizing the pipeline
 
-This example relies on the usage of OpenAI API, which is crucial to perform the embedding part.
+The code can be modified by changing the `app.yaml` configuration file. To read more about YAML files used in Pathway templates, read [our guide](https://pathway.com/developers/user-guide/llm-xpack/yaml-templates).
 
-**You need to have a working OpenAI key stored in the environment variable OPENAI_API_KEY**.
+In the `app.yaml` file we define:
+- input connectors
+- embedder
+- index
+and any of these can be replaced or, if no longer needed, removed. For components that can be used check 
+Pathway [LLM xpack](https://pathway.com/developers/user-guide/llm-xpack/overview), or you can implement your own.
 
-Please configure your key in a `.env` file by providing it as follows: `OPENAI_API_KEY=sk-*******`. You can refer to the stub file `.env` in this repository, where you will need to paste your key instead of `sk-*******`.
+Here some examples of what can be modified.
 
-## Sources configuration
+### Embedding Model
 
-You can configure data sources used for indexing by editing the configuration file. Here we provide the template config `sources_configuration.yaml` for these purposes. It contains stubs for the three possible input types - please refer to the examples.
+By default this template uses locally run model `mixedbread-ai/mxbai-embed-large-v1`. If you wish, you can replace this with any other model, by changing
+`$embedder` in `app.yaml`. For example, to use OpenAI embedder, set:
+```yaml
+$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
+  model: "text-embedding-ada-002"
+  cache_strategy: !pw.udfs.DiskCache
+```
 
-Each section of the config requires the specification of a data source type along with its parameters, such as the filesystem path, credentials, etc. The available kinds are `local`, `gdrive`, and `sharepoint`. The sections below describe the essential parameters that need to be specified for each of those sources.
+If you choose to use a provider, that requires API key, remember to set appropriate environmental values (you can also set them in `.env` file).
 
-### Local Data Source
+### Webserver
 
-The local data source is configured by setting the `kind` parameter to `local`.
+You can configure the host and the port of the webserver.
+Here is the default configuration:
+```yaml
+host: "0.0.0.0"
+port: 8000
+```
 
-The section `config` must contain the string parameter `path` denoting the path to a folder with files to be indexed.
+### Cache
 
-For the full list of the available configuration options, please refer to the filesystem connector [documentation](https://pathway.com/developers/api-docs/pathway-io/gdrive#pathway.io.fs.read).
+You can configure whether you want to enable cache, to avoid repeated API accesses, and where the cache is stored.
+Default values:
+```yaml
+with_cache: True
+cache_backend: !pw.persistence.Backend.filesystem
+  path: ".Cache"
+```
 
-### Google Drive Data Source
+### Data sources
 
-The Google Drive data source is enabled by setting the `kind` parameter to `gdrive`.
+You can configure the data sources by changing `$sources` in `app.yaml`.
+You can add as many data sources as you want. You can have several sources of the same kind, for instance, several local sources from different folders.
+The sections below describe how to configure local, Google Drive and Sharepoint source, but you can use any input [connector](https://pathway.com/developers/user-guide/connecting-to-data/connectors) from Pathway package.
 
-The section `config` must contain two main parameters:
-- `object_id`, containing the ID of the folder that needs to be indexed. It can be found from the URL in the web interface, where it's the last part of the address. For example, the publicly available demo folder in Google Drive has the URL `https://drive.google.com/drive/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`. Consequently, the last part of this address is `1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`, hence this is the `object_id` you would need to specify.
-- `service_user_credentials_file`, containing the path to the credentials files for the Google [service account](https://cloud.google.com/iam/docs/service-account-overview). To get more details on setting up the service account and getting credentials, you can also refer to [this tutorial](https://pathway.com/developers/user-guide/connectors/gdrive-connector/#setting-up-google-drive).
+By default, the app uses a local data source to read documents from the `data` folder.
 
-Besides, to speed up the indexing process you may want to specify the `refresh_interval` parameter, denoted by an integer number of seconds. It corresponds to the frequency between two sequential folder scans. If unset, it defaults to 30 seconds.
+#### Local Data Source
 
-For the full list of the available parameters, please refer to the Google Drive connector [documentation](https://pathway.com/developers/api-docs/pathway-io/gdrive#pathway.io.gdrive.read).
+The local data source is configured by using map with tag `!pw.io.fs.read`. Then set `path` to denote the path to a folder with files to be indexed.
 
-#### Using Provided Demo Folder
+#### Google Drive Data Source
 
-We provide a publicly available folder in Google Drive for demo purposes; you can access it [here](https://drive.google.com/drive/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs).
-
-A default configuration for the Google Drive source in `sources_configuration.yaml` is available and connects to the folder: uncomment the corresponding part and replace `SERVICE_CREDENTIALS` with the path to the credentials file.
-
-Once connected, you can upload files to the folder, which will be indexed by Pathway.
-Note that this folder is publicly available, and you cannot remove anything: **please be careful not to upload files containing any sensitive information**.
+The Google Drive data source is enabled by using map with tag `!pw.io.gdrive.read`. The map must contain two main parameters:
+- `object_id`, containing the ID of the folder that needs to be indexed. It can be found from the URL in the web interface, where it's the last part of the address. For example, the publicly available demo folder in Google Drive has the URL `https://drive.google.com/drive/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`. Consequently, the last part of this address is `1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`, hence this is the `object_id` you would need to specify.
+- `service_user_credentials_file`, containing the path to the credentials files for the Google [service account](https://cloud.google.com/iam/docs/service-account-overview). To get more details on setting up the service account and getting credentials, you can also refer to [this tutorial](https://pathway.com/developers/user-guide/connectors/gdrive-connector/#setting-up-google-drive).
 
-#### Using a Custom Folder
+Besides, to speed up the indexing process you may want to specify the `refresh_interval` parameter, denoted by an integer number of seconds. It corresponds to the frequency between two sequential folder scans. If unset, it defaults to 30 seconds.
 
-If you want to test the indexing pipeline with the data you wouldn't like to share with others, it's possible: with your service account, you won't have to share the folders you've created in your private Google Drive.
+For the full list of the available parameters, please refer to the Google Drive connector [documentation](https://pathway.com/developers/api-docs/pathway-io/gdrive#pathway.io.gdrive.read).
 
-Therefore, all you would need to do is the following:
-- Create a service account and download the credentials that will be used;
-- For running the demo, create your folder in Google Drive and don't share it.
+#### SharePoint Data Source
 
-### SharePoint Data Source
+This data source requires Scale or Enterprise [license key](https://pathway.com/pricing) - you can obtain free Scale key on [Pathway website](https://pathway.com/get-license).
 
-This data source is the part of commercial Pathway offering. You can try it online in one of the following demos:
-- The real-time document indexing pipeline with similarity search, available on the [Hosted Pipelines](https://pathway.com/solutions/ai-pipelines) webpage;
-- The chatbot answering questions about the uploaded files, available on [Streamlit](https://chat-realtime-sharepoint-gdrive.demo.pathway.com/).
+To use it, set the map tag to be `!pw.xpacks.connectors.sharepoint.read`, and then provide values of `url`, `tenant`, `client_id`, `cert_path`, `thumbprint` and `root_path`. To read about the meaning of these arguments, check the Sharepoint connector [documentation](https://pathway.com/developers/api-docs/pathway-xpacks-sharepoint/#pathway.xpacks.connectors.sharepoint.read).
 
 ## Running the Example
 

diff --git a/examples/pipelines/demo-document-indexing/app.py b/examples/pipelines/demo-document-indexing/app.py
@@ -26,12 +26,17 @@ class App(BaseModel):
     port: int = 8000
 
     with_cache: bool = True
+    cache_backend: InstanceOf[pw.persistence.Backend] = (
+        pw.persistence.Backend.filesystem("./Cache")
+    )
     terminate_on_error: bool = False
 
     def run(self) -> None:
         server = DocumentStoreServer(self.host, self.port, self.document_store)
         server.run(
-            with_cache=self.with_cache, terminate_on_error=self.terminate_on_error
+            with_cache=self.with_cache,
+            cache_backend=self.cache_backend,
+            terminate_on_error=self.terminate_on_error,
         )
 
     model_config = ConfigDict(extra="forbid")

diff --git a/examples/pipelines/demo-document-indexing/app.yaml b/examples/pipelines/demo-document-indexing/app.yaml
@@ -24,14 +24,6 @@ $sources:
   #   with_metadata: true
   #   refresh_interval: 30
 
-$llm: !pw.xpacks.llm.llms.OpenAIChat
-  model: "gpt-3.5-turbo"
-  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
-    max_retries: 6
-  cache_strategy: !pw.udfs.DiskCache
-  temperature: 0.05
-  capacity: 8
-
 $embedding_model: "mixedbread-ai/mxbai-embed-large-v1"
 
 $embedder: !pw.xpacks.llm.embedders.SentenceTransformerEmbedder
@@ -62,5 +54,10 @@ document_store: !pw.xpacks.llm.document_store.DocumentStore
 # host: "0.0.0.0"
 # port: 8000
 
+# Cache configuration
 # with_cache: true
+# cache_backend: !pw.persistence.Backend.filesystem
+#  path: ".Cache"
+
+# Set `terminate_on_error` to true if you want the program to terminate whenever any error is encountered
 # terminate_on_error: false