diff --git a/springai-vector/model/model.md b/springai-vector/model/model.md
index ebd89136..ddd8ad7d 100644
--- a/springai-vector/model/model.md
+++ b/springai-vector/model/model.md
@@ -16,14 +16,14 @@ Mac:
In this lab, you will:
-- Look at deploying Cohere AI Command-R models with Ollama and Oracle Cloud Infrastructure (OCI).
+- Deploy Cohere AI Command-R models with Ollama and Oracle Cloud Infrastructure (OCI).
- Look at the basic test of your model's endpoint for Command-R.
### Prerequisites
* This lab requires the completion of the **Setup Dev Environment** tutorial.
-## Task 1. Using Cohere AI's Command-R model to support chat and embeddings with private LLMs
+## Task 1. Use Cohere AI's Command-R model to support chat and embeddings with private LLMs
Cohere Command-R is a family of LLMs optimized for conversational interaction and long context tasks. Command R delivers high precision on retrieval augmented generation (RAG) with low latency and high throughput. You can get more details about the Command-R models at the [Command-R product page](https://cohere.com/command), and the full technical details are available at the [Model Details](https://docs.cohere.com/docs/command-r) section of its technical documentation.
@@ -62,12 +62,15 @@ Cohere Command-R is a family of LLMs optimized for conversational interaction an
7. At the end of creation process, obtain the **Public IPv4 address**, and with your private key (the one you generated or uploaded during creation), connect to:
```
+
ssh -i ./.key opc@[GPU_SERVER_IP]
+
```
8. Install and configure docker to use GPUs:
```
+
sudo /usr/libexec/oci-growfs
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y dnf-utils zip unzip
@@ -75,58 +78,73 @@ Cohere Command-R is a family of LLMs optimized for conversational interaction an
sudo dnf remove -y runc
sudo dnf install -y docker-ce --nobest
sudo useradd docker_user
+
```
9. We need to make sure that your Operating System user has permissions to run Docker containers. To do this, we can run the following command:
```
+
sudo visudo
+
```
And add this line at the end:
```
+
docker_user ALL=(ALL) NOPASSWD: /usr/bin/docker
+
```
10. For convenience, we need to switch to our new user. For this, run:
```
+
sudo su - docker_user
+
```
11. Finally, let's add an alias to execute Docker with admin privileges every time we type `docker` in our shell. For this, we need to modify a file, depending on your OS (in `.bash_profile` (MacOS) / `.bashrc` (Linux)). Insert, at the end of the file, this command:
```
+
alias docker="sudo /usr/bin/docker"
exit
+
```
12. We finalize our installation by executing:
```
+
sudo yum install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
+
```
13. If you're on Ubuntu instead, run:
```
+
sudo apt-get install nvidia-container-toolkit=1.14.3-1 \
nvidia-container-toolkit-base=1.14.3-1 \
libnvidia-container-tools=1.14.3-1 \
libnvidia-container1=1.14.3-1
sudo apt-get install -y nvidia-docker2
+
```
13. Let's reboot and re-connect to the VM, and run again:
```
+
sudo reboot now
# after restart, run:
sudo su - docker_user
+
```
14. Run `docker` to check if everything it's ok.
@@ -134,10 +152,12 @@ sudo su - docker_user
15. Let's run a Docker container with the `ollama/llama2` model for embeddings/completion:
```
+
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama serve
docker exec -it ollama ollama pull command-r
docker exec -it ollama ollama pull llama3
docker logs -f --tail 10 ollama
+
```
Both the model, for embeddings/completion will run under the same server, and they will be addressed providing in the REST request for the specific model required.
@@ -173,19 +193,23 @@ Your configured ingress rule:
6. Configure the environment variables below directly, or update the `env.sh` file and run `source ./env.sh`:
```
+
export OLLAMA_URL=http://[GPU_SERVER_IP]:11434
export OLLAMA_EMBEDDINGS=command-r
export OLLAMA_MODEL=command-r
+
```
7. Test with a shell running:
```
+
curl ${OLLAMA_URL}/api/generate -d '{
"model": "command-r",
"prompt":"Who is Ayrton Senna?"
}'
+
```
You'll receive the response in continuous sequential responses, facilitating the delivery of the content chunk by chunk, instead of forcing API users to wait for the whole response to be generated before it's displayed to users.