Skip to content

Commit

Permalink
Update llm app readmes (#7536)
Browse files Browse the repository at this point in the history
Co-authored-by: berkecanrizai <[email protected]>
GitOrigin-RevId: 697024be46cecfa892e9a40fddbb4ebbed1f6fcd
  • Loading branch information
2 people authored and Manul from Pathway committed Oct 29, 2024
1 parent 18af9a6 commit c31390e
Show file tree
Hide file tree
Showing 18 changed files with 483 additions and 203 deletions.
101 changes: 90 additions & 11 deletions examples/pipelines/adaptive-rag/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,27 +32,106 @@ We also set `strict_prompt=True`. This adjusts the prompt with additional instru

We encourage you to check the implementation of `answer_with_geometric_rag_strategy_from_index`.

## Modifying the code
## Customizing the pipeline

Under the main function, we define:
- input folders
The code can be modified by changing the `app.yaml` configuration file. To read more about YAML files used in Pathway templates, read [our guide](https://pathway.com/developers/user-guide/llm-xpack/yaml-templates).

In the `app.yaml` file we define:
- input connectors
- LLM
- embedder
- index
- host and port to run the app
- run options (caching, cache folder)
and any of these can be replaced or, if no longer needed, removed. For components that can be used check
Pathway [LLM xpack](https://pathway.com/developers/user-guide/llm-xpack/overview), or you can implement your own.

You can also check our other templates - [demo-question-answering](https://github.com/pathwaycom/llm-app/tree/main/examples/pipelines/demo-question-answering),
[Multimodal RAG](https://github.com/pathwaycom/llm-app/tree/main/examples/pipelines/gpt_4o_multimodal_rag) or
[Private RAG](https://github.com/pathwaycom/llm-app/tree/main/examples/pipelines/private-rag). As all of these only differ
in the YAML configuration file, you can also use them as an inspiration for your custom pipeline.

Here some examples of what can be modified.

### LLM Model

You can choose any of the GPT-3.5 Turbo, GPT-4, or GPT-4 Turbo models proposed by Open AI.
You can find the whole list on their [models page](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo).

You simply need to change the `model` to the one you want to use:
```yaml
$llm: !pw.xpacks.llm.llms.OpenAIChat
model: "gpt-3.5-turbo"
retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
max_retries: 6
cache_strategy: !pw.udfs.DiskCache
temperature: 0.05
capacity: 8
```
The default model is `gpt-3.5-turbo`

You can also use different provider, by using different class from [Pathway LLM xpack](https://pathway.com/developers/user-guide/llm-xpack/overview),
e.g. here is configuration for locally run Mistral model.

```yaml
$llm: !pw.xpacks.llm.llms.LiteLLMChat
model: "ollama/mistral"
retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
max_retries: 6
cache_strategy: !pw.udfs.DiskCache
temperature: 0
top_p: 1
api_base: "http://localhost:11434"
```

### Webserver

You can configure the host and the port of the webserver.
Here is the default configuration:
```yaml
host: "0.0.0.0"
port: 8000
```

### Cache

You can configure whether you want to enable cache, to avoid repeated API accesses, and where the cache is stored.
Default values:
```yaml
with_cache: True
cache_backend: !pw.persistence.Backend.filesystem
path: ".Cache"
```

### Data sources

You can configure the data sources by changing `$sources` in `app.yaml`.
You can add as many data sources as you want. You can have several sources of the same kind, for instance, several local sources from different folders.
The sections below describe how to configure local, Google Drive and Sharepoint source, but you can use any input [connector](https://pathway.com/developers/user-guide/connecting-to-data/connectors) from Pathway package.

By default, the app uses a local data source to read documents from the `data` folder.

#### Local Data Source

The local data source is configured by using map with tag `!pw.io.fs.read`. Then set `path` to denote the path to a folder with files to be indexed.

#### Google Drive Data Source

The Google Drive data source is enabled by using map with tag `!pw.io.gdrive.read`. The map must contain two main parameters:
- `object_id`, containing the ID of the folder that needs to be indexed. It can be found from the URL in the web interface, where it's the last part of the address. For example, the publicly available demo folder in Google Drive has the URL `https://drive.google.com/drive/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`. Consequently, the last part of this address is `1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`, hence this is the `object_id` you would need to specify.
- `service_user_credentials_file`, containing the path to the credentials files for the Google [service account](https://cloud.google.com/iam/docs/service-account-overview). To get more details on setting up the service account and getting credentials, you can also refer to [this tutorial](https://pathway.com/developers/user-guide/connectors/gdrive-connector/#setting-up-google-drive).

Besides, to speed up the indexing process you may want to specify the `refresh_interval` parameter, denoted by an integer number of seconds. It corresponds to the frequency between two sequential folder scans. If unset, it defaults to 30 seconds.

By default, we used OpenAI `gpt-3.5-turbo`. However, as done in the showcase, it is possible to use any LLM, including locally deployed LLMs.
For the full list of the available parameters, please refer to the Google Drive connector [documentation](https://pathway.com/developers/api-docs/pathway-io/gdrive#pathway.io.gdrive.read).

If you are interested in building this app in a fully private & local setup, check out the [private RAG example](../private-rag/README.md) that uses `Mistral 7B` as the LLM with a local embedding model.
#### SharePoint Data Source

You can modify any of the used components by checking the options from: `from pathway.xpacks.llm import embedders, llms, parsers, splitters`.
It is also possible to easily create new components by extending the [`pw.UDF`](https://pathway.com/developers/user-guide/data-transformation/user-defined-functions) class and implementing the `__wrapped__` function.
This data source requires Scale or Enterprise [license key](https://pathway.com/pricing) - you can obtain free Scale key on [Pathway website](https://pathway.com/get-license).

To see the setup used in our work, check [the showcase](https://pathway.com/developers/templates/private-rag-ollama-mistral).
To use it, set the map tag to be `!pw.xpacks.connectors.sharepoint.read`, and then provide values of `url`, `tenant`, `client_id`, `cert_path`, `thumbprint` and `root_path`. To read about the meaning of these arguments, check the Sharepoint connector [documentation](https://pathway.com/developers/api-docs/pathway-xpacks-sharepoint/#pathway.xpacks.connectors.sharepoint.read).

## Running the app
To run the app you need to set your OpenAI API key, by setting the environmental variable `OPENAI_API_KEY` or creating an `.env` file in this directory with line `OPENAI_API_KEY=sk-...`. If you modify the code to use another LLM provider, you may need to set a relevant API key.
To run the app, depending on the configuration, you may need to set up environmntal variables with LLM provider keys. By default, this template uses OpenAI API, so to run it you need to set `OPENAI_API_KEY` environmental key or create an `.env` file in this directory with your key: `OPENAI_API_KEY=sk-...`. If you modify the code to use another LLM provider, you may need to set a relevant API key.

### With Docker
In order to let the pipeline get updated with each change in local files, you need to mount the folder onto the docker. The following commands show how to do that.
Expand Down
7 changes: 6 additions & 1 deletion examples/pipelines/adaptive-rag/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,17 @@ class App(BaseModel):
port: int = 8000

with_cache: bool = True
cache_backend: InstanceOf[pw.persistence.Backend] = (
pw.persistence.Backend.filesystem("./Cache")
)
terminate_on_error: bool = False

def run(self) -> None:
server = QASummaryRestServer(self.host, self.port, self.question_answerer)
server.run(
with_cache=self.with_cache, terminate_on_error=self.terminate_on_error
with_cache=self.with_cache,
cache_backend=self.cache_backend,
terminate_on_error=self.terminate_on_error,
)

model_config = ConfigDict(extra="forbid")
Expand Down
7 changes: 6 additions & 1 deletion examples/pipelines/adaptive-rag/app.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,14 @@ question_answerer: !pw.xpacks.llm.question_answering.AdaptiveRAGQuestionAnswerer
strict_prompt: true


# Change host and port by uncommenting these files
# Change host and port by uncommenting these lines
# host: "0.0.0.0"
# port: 8000

# Cache configuration
# with_cache: true
# cache_backend: !pw.persistence.Backend.filesystem
# path: ".Cache"

# Set `terminate_on_error` to true if you want the program to terminate whenever any error is encountered
# terminate_on_error: false
85 changes: 50 additions & 35 deletions examples/pipelines/demo-document-indexing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,69 +39,84 @@ Finally, the embeddings are indexed with the capabilities of Pathway's machine-l

This folder contains several objects:
- `main.py`, the pipeline code using Pathway and written in Python;
- `sources_configuration.yaml`, the file containing configuration stubs for the data sources. It needs to be customized if you want to use the Google Drive data source or to change the filesystem directories that will be indexed;
- `app.yaml`, the file containing configuration of the pipeline, like embedding model, sources, or the server address;
- `requirements.txt`, the textfile denoting the pip dependencies for running this pipeline. It can be passed to `pip install -r ...` to install everything that is needed to launch the pipeline locally;
- `Dockerfile`, the Docker configuration for running the pipeline in the container;
- `docker-compose.yml`, the docker-compose configuration for running the pipeline along with the chat UI;
- `.env`, a short environment variables configuration file where the OpenAI key must be stored;
- `files-for-indexing/`, a folder with exemplary files that can be used for the test runs.

## OpenAPI Key Configuration
## Customizing the pipeline

This example relies on the usage of OpenAI API, which is crucial to perform the embedding part.
The code can be modified by changing the `app.yaml` configuration file. To read more about YAML files used in Pathway templates, read [our guide](https://pathway.com/developers/user-guide/llm-xpack/yaml-templates).

**You need to have a working OpenAI key stored in the environment variable OPENAI_API_KEY**.
In the `app.yaml` file we define:
- input connectors
- embedder
- index
and any of these can be replaced or, if no longer needed, removed. For components that can be used check
Pathway [LLM xpack](https://pathway.com/developers/user-guide/llm-xpack/overview), or you can implement your own.

Please configure your key in a `.env` file by providing it as follows: `OPENAI_API_KEY=sk-*******`. You can refer to the stub file `.env` in this repository, where you will need to paste your key instead of `sk-*******`.
Here some examples of what can be modified.

## Sources configuration
### Embedding Model

You can configure data sources used for indexing by editing the configuration file. Here we provide the template config `sources_configuration.yaml` for these purposes. It contains stubs for the three possible input types - please refer to the examples.
By default this template uses locally run model `mixedbread-ai/mxbai-embed-large-v1`. If you wish, you can replace this with any other model, by changing
`$embedder` in `app.yaml`. For example, to use OpenAI embedder, set:
```yaml
$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
model: "text-embedding-ada-002"
cache_strategy: !pw.udfs.DiskCache
```
Each section of the config requires the specification of a data source type along with its parameters, such as the filesystem path, credentials, etc. The available kinds are `local`, `gdrive`, and `sharepoint`. The sections below describe the essential parameters that need to be specified for each of those sources.
If you choose to use a provider, that requires API key, remember to set appropriate environmental values (you can also set them in `.env` file).

### Local Data Source
### Webserver

The local data source is configured by setting the `kind` parameter to `local`.
You can configure the host and the port of the webserver.
Here is the default configuration:
```yaml
host: "0.0.0.0"
port: 8000
```

The section `config` must contain the string parameter `path` denoting the path to a folder with files to be indexed.
### Cache

For the full list of the available configuration options, please refer to the filesystem connector [documentation](https://pathway.com/developers/api-docs/pathway-io/gdrive#pathway.io.fs.read).
You can configure whether you want to enable cache, to avoid repeated API accesses, and where the cache is stored.
Default values:
```yaml
with_cache: True
cache_backend: !pw.persistence.Backend.filesystem
path: ".Cache"
```

### Google Drive Data Source
### Data sources

The Google Drive data source is enabled by setting the `kind` parameter to `gdrive`.
You can configure the data sources by changing `$sources` in `app.yaml`.
You can add as many data sources as you want. You can have several sources of the same kind, for instance, several local sources from different folders.
The sections below describe how to configure local, Google Drive and Sharepoint source, but you can use any input [connector](https://pathway.com/developers/user-guide/connecting-to-data/connectors) from Pathway package.

The section `config` must contain two main parameters:
- `object_id`, containing the ID of the folder that needs to be indexed. It can be found from the URL in the web interface, where it's the last part of the address. For example, the publicly available demo folder in Google Drive has the URL `https://drive.google.com/drive/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`. Consequently, the last part of this address is `1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`, hence this is the `object_id` you would need to specify.
- `service_user_credentials_file`, containing the path to the credentials files for the Google [service account](https://cloud.google.com/iam/docs/service-account-overview). To get more details on setting up the service account and getting credentials, you can also refer to [this tutorial](https://pathway.com/developers/user-guide/connectors/gdrive-connector/#setting-up-google-drive).
By default, the app uses a local data source to read documents from the `data` folder.

Besides, to speed up the indexing process you may want to specify the `refresh_interval` parameter, denoted by an integer number of seconds. It corresponds to the frequency between two sequential folder scans. If unset, it defaults to 30 seconds.
#### Local Data Source

For the full list of the available parameters, please refer to the Google Drive connector [documentation](https://pathway.com/developers/api-docs/pathway-io/gdrive#pathway.io.gdrive.read).
The local data source is configured by using map with tag `!pw.io.fs.read`. Then set `path` to denote the path to a folder with files to be indexed.

#### Using Provided Demo Folder
#### Google Drive Data Source

We provide a publicly available folder in Google Drive for demo purposes; you can access it [here](https://drive.google.com/drive/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs).

A default configuration for the Google Drive source in `sources_configuration.yaml` is available and connects to the folder: uncomment the corresponding part and replace `SERVICE_CREDENTIALS` with the path to the credentials file.

Once connected, you can upload files to the folder, which will be indexed by Pathway.
Note that this folder is publicly available, and you cannot remove anything: **please be careful not to upload files containing any sensitive information**.
The Google Drive data source is enabled by using map with tag `!pw.io.gdrive.read`. The map must contain two main parameters:
- `object_id`, containing the ID of the folder that needs to be indexed. It can be found from the URL in the web interface, where it's the last part of the address. For example, the publicly available demo folder in Google Drive has the URL `https://drive.google.com/drive/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`. Consequently, the last part of this address is `1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs`, hence this is the `object_id` you would need to specify.
- `service_user_credentials_file`, containing the path to the credentials files for the Google [service account](https://cloud.google.com/iam/docs/service-account-overview). To get more details on setting up the service account and getting credentials, you can also refer to [this tutorial](https://pathway.com/developers/user-guide/connectors/gdrive-connector/#setting-up-google-drive).

#### Using a Custom Folder
Besides, to speed up the indexing process you may want to specify the `refresh_interval` parameter, denoted by an integer number of seconds. It corresponds to the frequency between two sequential folder scans. If unset, it defaults to 30 seconds.

If you want to test the indexing pipeline with the data you wouldn't like to share with others, it's possible: with your service account, you won't have to share the folders you've created in your private Google Drive.
For the full list of the available parameters, please refer to the Google Drive connector [documentation](https://pathway.com/developers/api-docs/pathway-io/gdrive#pathway.io.gdrive.read).

Therefore, all you would need to do is the following:
- Create a service account and download the credentials that will be used;
- For running the demo, create your folder in Google Drive and don't share it.
#### SharePoint Data Source

### SharePoint Data Source
This data source requires Scale or Enterprise [license key](https://pathway.com/pricing) - you can obtain free Scale key on [Pathway website](https://pathway.com/get-license).

This data source is the part of commercial Pathway offering. You can try it online in one of the following demos:
- The real-time document indexing pipeline with similarity search, available on the [Hosted Pipelines](https://pathway.com/solutions/ai-pipelines) webpage;
- The chatbot answering questions about the uploaded files, available on [Streamlit](https://chat-realtime-sharepoint-gdrive.demo.pathway.com/).
To use it, set the map tag to be `!pw.xpacks.connectors.sharepoint.read`, and then provide values of `url`, `tenant`, `client_id`, `cert_path`, `thumbprint` and `root_path`. To read about the meaning of these arguments, check the Sharepoint connector [documentation](https://pathway.com/developers/api-docs/pathway-xpacks-sharepoint/#pathway.xpacks.connectors.sharepoint.read).

## Running the Example

Expand Down
7 changes: 6 additions & 1 deletion examples/pipelines/demo-document-indexing/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,17 @@ class App(BaseModel):
port: int = 8000

with_cache: bool = True
cache_backend: InstanceOf[pw.persistence.Backend] = (
pw.persistence.Backend.filesystem("./Cache")
)
terminate_on_error: bool = False

def run(self) -> None:
server = DocumentStoreServer(self.host, self.port, self.document_store)
server.run(
with_cache=self.with_cache, terminate_on_error=self.terminate_on_error
with_cache=self.with_cache,
cache_backend=self.cache_backend,
terminate_on_error=self.terminate_on_error,
)

model_config = ConfigDict(extra="forbid")
Expand Down
13 changes: 5 additions & 8 deletions examples/pipelines/demo-document-indexing/app.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,6 @@ $sources:
# with_metadata: true
# refresh_interval: 30

$llm: !pw.xpacks.llm.llms.OpenAIChat
model: "gpt-3.5-turbo"
retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
max_retries: 6
cache_strategy: !pw.udfs.DiskCache
temperature: 0.05
capacity: 8

$embedding_model: "mixedbread-ai/mxbai-embed-large-v1"

$embedder: !pw.xpacks.llm.embedders.SentenceTransformerEmbedder
Expand Down Expand Up @@ -62,5 +54,10 @@ document_store: !pw.xpacks.llm.document_store.DocumentStore
# host: "0.0.0.0"
# port: 8000

# Cache configuration
# with_cache: true
# cache_backend: !pw.persistence.Backend.filesystem
# path: ".Cache"

# Set `terminate_on_error` to true if you want the program to terminate whenever any error is encountered
# terminate_on_error: false
Loading

0 comments on commit c31390e

Please sign in to comment.