From 3d1ccf295aa9f34392a012f921b57b13dd7f1d34 Mon Sep 17 00:00:00 2001
From: sallyom <somalley@redhat.com>
Date: Tue, 3 Dec 2024 14:46:12 -0500
Subject: [PATCH] Documentation updates & .gitignore add

Signed-off-by: sallyom <somalley@redhat.com>
---
 .gitignore                                    |   2 +
 model_servers/llamacpp_python/README.md       |  21 ++--
 .../natural_language_processing/rag/README.md | 102 ++++++++++++++----
 3 files changed, 97 insertions(+), 28 deletions(-)

diff --git a/.gitignore b/.gitignore
index 920bc902..0fdc1e3b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -10,7 +10,9 @@ models/*
 convert_models/converted_models
 recipes/common/bin/*
 */.venv/
+**/venv/**
 training/cloud/examples
 training/instructlab/instructlab
 vector_dbs/milvus/volumes/milvus/*
 .idea
+**/volumes/**
diff --git a/model_servers/llamacpp_python/README.md b/model_servers/llamacpp_python/README.md
index 72b98ad0..4e83ee29 100644
--- a/model_servers/llamacpp_python/README.md
+++ b/model_servers/llamacpp_python/README.md
@@ -19,8 +19,9 @@ The [base image](../llamacpp_python/base/Containerfile) is the standard image th
 To build the base model service image:
 
 ```bash
-make -f Makefile build
+make build
 ```
+
 To pull the base model service image:
 
 ```bash
@@ -33,8 +34,9 @@ podman pull quay.io/ai-lab/llamacpp_python
 The [Cuda image](../llamacpp_python/cuda/Containerfile) include all the extra drivers necessary to run our model server with Nvidia GPUs. This will significant speed up the models response time over CPU only deployments.
 
 To Build the the Cuda variant image:
+
 ```bash
-make -f Makefile build-cuda
+make build-cuda
 ```
 
 To pull the base model service image:
@@ -48,6 +50,7 @@ podman pull quay.io/ai-lab/llamacpp_python_cuda
 To run the Cuda image with GPU acceleration, you need to install the correct [Cuda drivers](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#driver-installation) for your system along with the [Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#). Please use the links provided to find installation instructions for your system.
 
 Once those are installed you can use the container toolkit CLI to discover your Nvidia device(s).
+
 ```bash
 sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
 ```
@@ -63,8 +66,8 @@ To build the Vulkan model service variant image:
 
 | System Architecture | Command |
 |---|---|
-| amd64 | make -f Makefile build-vulkan-amd64 |
-| arm64 | make -f Makefile build-vulkan-arm64 |
+| amd64 | make build-vulkan-amd64 |
+| arm64 | make build-vulkan-arm64 |
 
 To pull the base model service image:
 
@@ -73,7 +76,6 @@ podman pull quay.io/ai-lab/llamacpp_python_vulkan
 ```
 
 
-
 ## Download Model(s)
 
 There are many models to choose from these days, most of which can be found on [huggingface.co](https://huggingface.co). In order to use a model with the llamacpp_python model server, it must be in GGUF format. You can either download pre-converted GGUF models directly or convert them yourself with the [model converter utility](../../convert_models/) available in this repo.
@@ -88,7 +90,7 @@ Place all models in the [models](../../models/) directory.
 You can use this snippet below to download the default model:
 
 ```bash
-make -f Makefile download-model-granite
+make download-model-granite
 ```
 
 Or you can use the generic `download-models` target from the `/models` directory to download any model file from huggingface:
@@ -107,7 +109,7 @@ make MODEL_NAME=<model_name> MODEL_URL=<model_url> -f  Makefile download-model
 To deploy the LLM server you must specify a volume mount `-v` where your models are stored on the host machine and the `MODEL_PATH` for your model of choice. The model_server is most easily deploy from calling the make command: `make -f Makefile run`. Of course as with all our make calls you can pass any number of the following variables: `REGISTRY`, `IMAGE_NAME`, `MODEL_NAME`, `MODEL_PATH`, and `PORT`.
 
 ```bash
-podman run --rm -it \
+podman run --rm -d \
   -p 8001:8001 \
   -v Local/path/to/locallm/models:/locallm/models:ro \
   -e MODEL_PATH=models/granite-7b-lab-Q4_K_M.gguf \
@@ -120,7 +122,7 @@ podman run --rm -it \
 or with Cuda image
 
 ```bash
-podman run --rm -it \
+podman run --rm -d \
   --device nvidia.com/gpu=all \
   -p 8001:8001 \
   -v Local/path/to/locallm/models:/locallm/models:ro \
@@ -130,6 +132,7 @@ podman run --rm -it \
   -e MODEL_CHAT_FORMAT=openchat \
   llamacpp_python
 ```
+
 ### Multiple Model Service:
 
 To enable dynamic loading and unloading of different models present on your machine, you can start the model service with a `CONFIG_PATH` instead of a `MODEL_PATH`.
@@ -159,7 +162,7 @@ Here is an example `models_config.json` with two model options.
 Now run the container with the specified config file.
 
 ```bash
-podman run --rm -it -d \
+podman run --rm -d \
         -p 8001:8001 \
         -v Local/path/to/locallm/models:/locallm/models:ro \
         -e CONFIG_PATH=models/<config-filename> \
diff --git a/recipes/natural_language_processing/rag/README.md b/recipes/natural_language_processing/rag/README.md
index 3ed84e34..cae472d1 100644
--- a/recipes/natural_language_processing/rag/README.md
+++ b/recipes/natural_language_processing/rag/README.md
@@ -13,24 +13,27 @@ Our AI Application will connect to our Model Service via it's OpenAI compatible
 
 ## Try the RAG chat application
 
-_COMING SOON to AI LAB_
 The [Podman Desktop](https://podman-desktop.io) [AI Lab Extension](https://github.com/containers/podman-desktop-extension-ai-lab) includes this recipe among others. To try it out, open `Recipes Catalog` -> `RAG Chatbot` and follow the instructions to start the application.
 
 If you prefer building and running the application from terminal, please run the following commands from this directory.
 
 First, build application's meta data and run the generated Kubernetes YAML which will spin up a Pod along with a number of containers:
+
 ```
+cd recipes/natural_language_processing/rag
 make quadlet
 podman kube play build/rag.yaml
 ```
 
 The Pod is named `rag`, so you may use [Podman](https://podman.io) to manage the Pod and its containers:
+
 ```
 podman pod list
 podman ps
 ```
 
 To stop and remove the Pod, run:
+
 ```
 podman pod stop rag
 podman pod rm rag
@@ -59,12 +62,10 @@ The recommended model can be downloaded using the code snippet below:
 
 ```bash
 cd ../../../models
-curl -sLO https://huggingface.co/instructlab/granite-7b-lab-GGUF/resolve/main/granite-7b-lab-Q4_K_M.gguf
+make download-model-granite
 cd ../recipes/natural_language_processing/rag
 ```
 
-_A full list of supported open models is forthcoming._  
-
 In addition to the LLM, RAG applications also require an embedding model to convert documents between natural language and vector representations. For this demo we will use [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) it is a fairly standard model for this use case and has an MIT license.    
 
 The code snippet below can be used to pull a copy of the `BAAI/bge-base-en-v1.5` embedding model and store it in your `models/` directory. 
@@ -82,25 +83,39 @@ To deploy the Vector Database service locally, simply use the existing ChromaDB
 
 
 #### ChromaDB
+
 ```bash
 podman pull chromadb/chroma
 ```
+
+```bash
+podman run --rm -d --name chroma -p 8000:8000 chroma
+```
+
+Check that the chroma pod is running with
+
 ```bash
-podman run --rm -it -p 8000:8000 chroma
+podman ps
+podman logs chroma
 ```
+
 #### Milvus
+
 ```bash
 podman pull milvusdb/milvus:master-20240426-bed6363f
+cd recipes/natural_language_processing/rag/app
+mkdir -p volumes/milvus
 ```
+
 ```bash
-podman run -it \
+podman run --rm -d \
         --name milvus-standalone \
         --security-opt seccomp:unconfined \
         -e ETCD_USE_EMBED=true \
         -e ETCD_CONFIG_PATH=/milvus/configs/embedEtcd.yaml \
         -e COMMON_STORAGETYPE=local \
         -v $(pwd)/volumes/milvus:/var/lib/milvus \
-        -v $(pwd)/embedEtcd.yaml:/milvus/configs/embedEtcd.yaml \
+        -v $(pwd)/milvus-embedEtcd.yaml:/milvus/configs/embedEtcd.yaml \
         -p 19530:19530 \
         -p 9091:9091 \
         -p 2379:2379 \
@@ -112,28 +127,47 @@ podman run -it \
         milvusdb/milvus:master-20240426-bed6363f \
         milvus run standalone  1> /dev/null
 ```
+
 Note: For running the Milvus instance, make sure you have the `$(pwd)/volumes/milvus` directory and `$(pwd)/embedEtcd.yaml` file as shown in this repository. These are required by the database for its operations.
 
+Example contents of milvus-embedEtcd.yaml are shown below. See
+[milvus-io/milvus/configs](https://github.com/milvus-io/milvus/blob/master/configs/advanced/etcd.yaml) for more details.
+
+```bash
+listen-client-urls: http://0.0.0.0:2379
+advertise-client-urls: http://0.0.0.0:2379
+quota-backend-bytes: 4294967296
+auto-compaction-mode: revision
+auto-compaction-retention: '1000'
+```
+
+Check that the milvus pod is running with
+
+```bash
+podman ps
+podman logs milvus-standalone
+```
 
 ### Build the Model Service
 
-The complete instructions for building and deploying the Model Service can be found in the [the llamacpp_python model-service document](../model_servers/llamacpp_python/README.md).
+The complete instructions for building and deploying the Model Service can be found in the [the llamacpp_python model-service document](../../../model_servers/llamacpp_python/README.md).
 
-The Model Service can be built with the following code snippet:
+The llamacpp_python Model Service can be built with the following code snippet:
 
 ```bash
-cd model_servers/llamacpp_python
-podman build -t llamacppserver -f ./base/Containerfile .
+cd ../../../model_servers/llamacpp_python
+make IMAGE=llamacppserver  build
 ```
 
 
 ### Deploy the Model Service
 
-The complete instructions for building and deploying the Model Service can be found in the [the llamacpp_python model-service document](../model_servers/llamacpp_python/README.md).
+The complete instructions for building and deploying the Model Service can be found in the [the llamacpp_python model-service document](../../..//model_servers/llamacpp_python/README.md).
 
 The local Model Service relies on a volume mount to the localhost to access the model files. You can start your local Model Service using the following Podman command:
+
 ```
-podman run --rm -it \
+podman run --rm -d --name model-server \
         -p 8001:8001 \
         -v Local/path/to/locallm/models:/locallm/models \
         -e MODEL_PATH=models/<model-filename> \
@@ -142,6 +176,13 @@ podman run --rm -it \
         llamacppserver
 ```
 
+Check that the model-server pod is running with
+
+```bash
+podman ps
+podman logs model-server
+```
+
 ### Build the AI Application
 
 Now that the Model Service is running we want to build and deploy our AI Application. Use the provided Containerfile to build the AI Application image in the `rag-langchain/` directory.
@@ -153,23 +194,46 @@ make APP_IMAGE=rag build
 
 ### Deploy the AI Application
 
-Make sure the Model Service and the Vector Database are up and running before starting this container image. When starting the AI Application container image we need to direct it to the correct `MODEL_ENDPOINT`. This could be any appropriately hosted Model Service (running locally or in the cloud) using an OpenAI compatible API. In our case the Model Service is running inside the Podman machine so we need to provide it with the appropriate address `10.88.0.1`. The same goes for the Vector Database. Make sure the `VECTORDB_HOST` is correctly set to `10.88.0.1` for communication within the Podman virtual machine.
+Make sure the Model Service and the Vector Database are up and running before starting this container image. When starting the AI Application container image we need to direct it to the correct `MODEL_ENDPOINT`. This could be any appropriately hosted Model Service (running locally or in the cloud) using an OpenAI compatible API.
 
 There also needs to be a volume mount into the `models/` directory so that the application can access the embedding model as well as a volume mount into the `data/` directory where it can pull documents from to populate the Vector Database.  
 
 The following Podman command can be used to run your AI Application:
 
+#### With Chroma Vector Database
+
 ```bash
-podman run --rm -it -p 8501:8501 \
--e MODEL_ENDPOINT=http://10.88.0.1:8001 \
--e VECTORDB_HOST=10.88.0.1 \
+podman run --rm --name rag-inference -d -p 8501:8501 \
+-e MODEL_ENDPOINT=http://127.0.0.1:8001 \
+-e VECTORDB_HOST=127.0.0.1 \
 -v Local/path/to/locallm/models/:/rag/models \
-rag   
+rag
+```
+
+#### With Milvus Standalone Vector Database
+
+```bash
+podman run --rm -d --name rag-inference -p 8501:8501 \
+-e MODEL_ENDPOINT=http://127.0.0.1:8001 \
+-e VECTORDB_VENDOR=milvus \
+-e VECTORDB_HOST=127.0.0.1 \
+-e VECTORDB_PORT=19530 \
+-v Local/path/to/locallm/models/:/rag/models \
+rag
+```
+
+Check that the rag inference pod is running with
+
+```bash
+podman ps
+podman logs rag-inference
 ```
 
 ### Interact with the AI Application
 
-Everything should now be up an running with the rag application available at [`http://localhost:8501`](http://localhost:8501). By using this recipe and getting this starting point established, users should now have an easier time customizing and building their own LLM enabled RAG applications.   
+Everything should now be up an running with the rag application available at [`http://localhost:8501`](http://localhost:8501).
+There is a [sample text file](./sample-data/fake_meeting.txt) that can be uploaded in the UI and used to test the RAG capablility.
+By using this recipe and getting this starting point established, users should now have an easier time customizing and building their own LLM enabled RAG applications.
 
 ### Embed the AI Application in a Bootable Container Image