Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation updates #828

Merged
merged 1 commit into from
Jan 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@ models/*
convert_models/converted_models
recipes/common/bin/*
*/.venv/
**/venv/**
training/cloud/examples
training/instructlab/instructlab
vector_dbs/milvus/volumes/milvus/*
.idea
**/volumes/**
21 changes: 12 additions & 9 deletions model_servers/llamacpp_python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,9 @@ The [base image](../llamacpp_python/base/Containerfile) is the standard image th
To build the base model service image:

```bash
make -f Makefile build
make build
```

To pull the base model service image:

```bash
Expand All @@ -33,8 +34,9 @@ podman pull quay.io/ai-lab/llamacpp_python
The [Cuda image](../llamacpp_python/cuda/Containerfile) include all the extra drivers necessary to run our model server with Nvidia GPUs. This will significant speed up the models response time over CPU only deployments.

To Build the the Cuda variant image:

```bash
make -f Makefile build-cuda
make build-cuda
```

To pull the base model service image:
Expand All @@ -48,6 +50,7 @@ podman pull quay.io/ai-lab/llamacpp_python_cuda
To run the Cuda image with GPU acceleration, you need to install the correct [Cuda drivers](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#driver-installation) for your system along with the [Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#). Please use the links provided to find installation instructions for your system.

Once those are installed you can use the container toolkit CLI to discover your Nvidia device(s).

```bash
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```
Expand All @@ -63,8 +66,8 @@ To build the Vulkan model service variant image:

| System Architecture | Command |
|---|---|
| amd64 | make -f Makefile build-vulkan-amd64 |
| arm64 | make -f Makefile build-vulkan-arm64 |
| amd64 | make build-vulkan-amd64 |
| arm64 | make build-vulkan-arm64 |

To pull the base model service image:

Expand All @@ -73,7 +76,6 @@ podman pull quay.io/ai-lab/llamacpp_python_vulkan
```



## Download Model(s)

There are many models to choose from these days, most of which can be found on [huggingface.co](https://huggingface.co). In order to use a model with the llamacpp_python model server, it must be in GGUF format. You can either download pre-converted GGUF models directly or convert them yourself with the [model converter utility](../../convert_models/) available in this repo.
Expand All @@ -88,7 +90,7 @@ Place all models in the [models](../../models/) directory.
You can use this snippet below to download the default model:

```bash
make -f Makefile download-model-granite
make download-model-granite
```

Or you can use the generic `download-models` target from the `/models` directory to download any model file from huggingface:
Expand All @@ -107,7 +109,7 @@ make MODEL_NAME=<model_name> MODEL_URL=<model_url> -f Makefile download-model
To deploy the LLM server you must specify a volume mount `-v` where your models are stored on the host machine and the `MODEL_PATH` for your model of choice. The model_server is most easily deploy from calling the make command: `make -f Makefile run`. Of course as with all our make calls you can pass any number of the following variables: `REGISTRY`, `IMAGE_NAME`, `MODEL_NAME`, `MODEL_PATH`, and `PORT`.

```bash
podman run --rm -it \
podman run --rm -d \
-p 8001:8001 \
-v Local/path/to/locallm/models:/locallm/models:ro \
-e MODEL_PATH=models/granite-7b-lab-Q4_K_M.gguf \
Expand All @@ -120,7 +122,7 @@ podman run --rm -it \
or with Cuda image

```bash
podman run --rm -it \
podman run --rm -d \
--device nvidia.com/gpu=all \
-p 8001:8001 \
-v Local/path/to/locallm/models:/locallm/models:ro \
Expand All @@ -130,6 +132,7 @@ podman run --rm -it \
-e MODEL_CHAT_FORMAT=openchat \
llamacpp_python
```

### Multiple Model Service:

To enable dynamic loading and unloading of different models present on your machine, you can start the model service with a `CONFIG_PATH` instead of a `MODEL_PATH`.
Expand Down Expand Up @@ -159,7 +162,7 @@ Here is an example `models_config.json` with two model options.
Now run the container with the specified config file.

```bash
podman run --rm -it -d \
podman run --rm -d \
-p 8001:8001 \
-v Local/path/to/locallm/models:/locallm/models:ro \
-e CONFIG_PATH=models/<config-filename> \
Expand Down
102 changes: 83 additions & 19 deletions recipes/natural_language_processing/rag/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,24 +13,27 @@ Our AI Application will connect to our Model Service via it's OpenAI compatible

## Try the RAG chat application

_COMING SOON to AI LAB_
The [Podman Desktop](https://podman-desktop.io) [AI Lab Extension](https://github.com/containers/podman-desktop-extension-ai-lab) includes this recipe among others. To try it out, open `Recipes Catalog` -> `RAG Chatbot` and follow the instructions to start the application.

If you prefer building and running the application from terminal, please run the following commands from this directory.

First, build application's meta data and run the generated Kubernetes YAML which will spin up a Pod along with a number of containers:

```
cd recipes/natural_language_processing/rag
make quadlet
podman kube play build/rag.yaml
```

The Pod is named `rag`, so you may use [Podman](https://podman.io) to manage the Pod and its containers:

```
podman pod list
podman ps
```

To stop and remove the Pod, run:

```
podman pod stop rag
podman pod rm rag
Expand Down Expand Up @@ -59,12 +62,10 @@ The recommended model can be downloaded using the code snippet below:

```bash
cd ../../../models
curl -sLO https://huggingface.co/instructlab/granite-7b-lab-GGUF/resolve/main/granite-7b-lab-Q4_K_M.gguf
make download-model-granite
cd ../recipes/natural_language_processing/rag
```

_A full list of supported open models is forthcoming._

In addition to the LLM, RAG applications also require an embedding model to convert documents between natural language and vector representations. For this demo we will use [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) it is a fairly standard model for this use case and has an MIT license.

The code snippet below can be used to pull a copy of the `BAAI/bge-base-en-v1.5` embedding model and store it in your `models/` directory.
Expand All @@ -82,25 +83,39 @@ To deploy the Vector Database service locally, simply use the existing ChromaDB


#### ChromaDB

```bash
podman pull chromadb/chroma
```

```bash
podman run --rm -d --name chroma -p 8000:8000 chroma
```

Check that the chroma pod is running with

```bash
podman run --rm -it -p 8000:8000 chroma
podman ps
podman logs chroma
```

#### Milvus

```bash
podman pull milvusdb/milvus:master-20240426-bed6363f
cd recipes/natural_language_processing/rag/app
mkdir -p volumes/milvus
```

```bash
podman run -it \
podman run --rm -d \
--name milvus-standalone \
--security-opt seccomp:unconfined \
-e ETCD_USE_EMBED=true \
-e ETCD_CONFIG_PATH=/milvus/configs/embedEtcd.yaml \
-e COMMON_STORAGETYPE=local \
-v $(pwd)/volumes/milvus:/var/lib/milvus \
-v $(pwd)/embedEtcd.yaml:/milvus/configs/embedEtcd.yaml \
-v $(pwd)/milvus-embedEtcd.yaml:/milvus/configs/embedEtcd.yaml \
-p 19530:19530 \
-p 9091:9091 \
-p 2379:2379 \
Expand All @@ -112,28 +127,47 @@ podman run -it \
milvusdb/milvus:master-20240426-bed6363f \
milvus run standalone 1> /dev/null
```

Note: For running the Milvus instance, make sure you have the `$(pwd)/volumes/milvus` directory and `$(pwd)/embedEtcd.yaml` file as shown in this repository. These are required by the database for its operations.

Example contents of milvus-embedEtcd.yaml are shown below. See
[milvus-io/milvus/configs](https://github.com/milvus-io/milvus/blob/master/configs/advanced/etcd.yaml) for more details.

```bash
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://0.0.0.0:2379
quota-backend-bytes: 4294967296
auto-compaction-mode: revision
auto-compaction-retention: '1000'
```

Check that the milvus pod is running with

```bash
podman ps
podman logs milvus-standalone
```

### Build the Model Service

The complete instructions for building and deploying the Model Service can be found in the [the llamacpp_python model-service document](../model_servers/llamacpp_python/README.md).
The complete instructions for building and deploying the Model Service can be found in the [the llamacpp_python model-service document](../../../model_servers/llamacpp_python/README.md).

The Model Service can be built with the following code snippet:
The llamacpp_python Model Service can be built with the following code snippet:

```bash
cd model_servers/llamacpp_python
podman build -t llamacppserver -f ./base/Containerfile .
cd ../../../model_servers/llamacpp_python
make IMAGE=llamacppserver build
```


### Deploy the Model Service

The complete instructions for building and deploying the Model Service can be found in the [the llamacpp_python model-service document](../model_servers/llamacpp_python/README.md).
The complete instructions for building and deploying the Model Service can be found in the [the llamacpp_python model-service document](../../..//model_servers/llamacpp_python/README.md).

The local Model Service relies on a volume mount to the localhost to access the model files. You can start your local Model Service using the following Podman command:

```
podman run --rm -it \
podman run --rm -d --name model-server \
-p 8001:8001 \
-v Local/path/to/locallm/models:/locallm/models \
-e MODEL_PATH=models/<model-filename> \
Expand All @@ -142,6 +176,13 @@ podman run --rm -it \
llamacppserver
```

Check that the model-server pod is running with

```bash
podman ps
podman logs model-server
```

### Build the AI Application

Now that the Model Service is running we want to build and deploy our AI Application. Use the provided Containerfile to build the AI Application image in the `rag-langchain/` directory.
Expand All @@ -153,23 +194,46 @@ make APP_IMAGE=rag build

### Deploy the AI Application

Make sure the Model Service and the Vector Database are up and running before starting this container image. When starting the AI Application container image we need to direct it to the correct `MODEL_ENDPOINT`. This could be any appropriately hosted Model Service (running locally or in the cloud) using an OpenAI compatible API. In our case the Model Service is running inside the Podman machine so we need to provide it with the appropriate address `10.88.0.1`. The same goes for the Vector Database. Make sure the `VECTORDB_HOST` is correctly set to `10.88.0.1` for communication within the Podman virtual machine.
Make sure the Model Service and the Vector Database are up and running before starting this container image. When starting the AI Application container image we need to direct it to the correct `MODEL_ENDPOINT`. This could be any appropriately hosted Model Service (running locally or in the cloud) using an OpenAI compatible API.

There also needs to be a volume mount into the `models/` directory so that the application can access the embedding model as well as a volume mount into the `data/` directory where it can pull documents from to populate the Vector Database.

The following Podman command can be used to run your AI Application:

#### With Chroma Vector Database

```bash
podman run --rm -it -p 8501:8501 \
-e MODEL_ENDPOINT=http://10.88.0.1:8001 \
-e VECTORDB_HOST=10.88.0.1 \
podman run --rm --name rag-inference -d -p 8501:8501 \
-e MODEL_ENDPOINT=http://127.0.0.1:8001 \
-e VECTORDB_HOST=127.0.0.1 \
-v Local/path/to/locallm/models/:/rag/models \
rag
rag
```

#### With Milvus Standalone Vector Database

```bash
podman run --rm -d --name rag-inference -p 8501:8501 \
-e MODEL_ENDPOINT=http://127.0.0.1:8001 \
-e VECTORDB_VENDOR=milvus \
-e VECTORDB_HOST=127.0.0.1 \
-e VECTORDB_PORT=19530 \
-v Local/path/to/locallm/models/:/rag/models \
rag
```

Check that the rag inference pod is running with

```bash
podman ps
podman logs rag-inference
```

### Interact with the AI Application

Everything should now be up an running with the rag application available at [`http://localhost:8501`](http://localhost:8501). By using this recipe and getting this starting point established, users should now have an easier time customizing and building their own LLM enabled RAG applications.
Everything should now be up an running with the rag application available at [`http://localhost:8501`](http://localhost:8501).
There is a [sample text file](./sample-data/fake_meeting.txt) that can be uploaded in the UI and used to test the RAG capablility.
By using this recipe and getting this starting point established, users should now have an easier time customizing and building their own LLM enabled RAG applications.

### Embed the AI Application in a Bootable Container Image

Expand Down