Skip to content

Commit

Permalink
Arm RAG
Browse files Browse the repository at this point in the history
Signed-off-by: ChengZi <[email protected]>
  • Loading branch information
zc277584121 committed Sep 23, 2024
1 parent 0e5daad commit 9ff10e5
Showing 1 changed file with 332 additions and 0 deletions.
332 changes: 332 additions & 0 deletions bootcamp/tutorials/integration/build_rag_on_arm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,332 @@
# Build RAG on ARM Architecture
## Before you begin
Arm CPUs are widely used in traditional ML and AI use cases. In this tutorial, you learn how to build a RAG application on an Arm-based CPUs. You deploy the `Llama-3.1-8B` model on your AWS Arm-based server CPU using `llama.cpp`, and also utilize Zilliz, the fully-managed Milvus cloud service built on Arm architecture.


### AWS Graviton
[AWS Graviton](https://aws.amazon.com/ec2/graviton/) is a family of processors designed to deliver the best price performance for your cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2).

You need an Arm server instance with at least four cores and 8GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an AWS Graviton3 c7g.2xlarge instance with Ubuntu 22.04 LTS system.


### Llama 3.1 model & llama.cpp

The [Llama-3.1-8B model](https://huggingface.co/cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf) from Meta belongs to the Llama 3.1 model family and is free to use for research and commercial purposes. Before you use the model, visit the Llama [website](https://llama.meta.com/llama-downloads/) and fill in the form to request access.

[llama.cpp](https://github.com/ggerganov/llama.cpp) is an open source C/C++ project that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud. You can conveniently host a Llama 3.1 model using `llama.cpp`.


### GGUF model format
Traditionally, large language models (LLMs) rely on GPUs and full-precision 32-bit (FP32) or half-precision 16-bit (FP16) formats for training and inference. However, the introduction of the GGUF model format by the llama.cpp team leverages compression and quantization techniques to reduce dependency on FP32/FP16, allowing weights to be scaled down to 4-bit integers. This significantly lowers computational and memory requirements, making Arm CPUs an efficient platform for LLM inference.


## Launch LLM Service on ARM
### Install dependencies

Install the following packages on your Arm based server instance:

```bash
sudo apt update
sudo apt install make cmake -y
```

You also need to install `gcc` on your machine:

```bash
sudo apt install gcc g++ -y
sudo apt install build-essential -y
```

### Download and build llama.cpp

You are now ready to start building `llama.cpp`.

Clone the source repository for llama.cpp:

```bash
git clone https://github.com/ggerganov/llama.cpp
```

By default, `llama.cpp` builds for CPU only on Linux and Windows. You don't need to provide any extra switches to build it for the Arm CPU that you run it on.

Run `make` to build it:

```bash
cd llama.cpp
make GGML_NO_LLAMAFILE=1 -j$(nproc)
```

Check that `llama.cpp` has built correctly by running the help command:

```bash
./llama-cli -h
```

If `llama.cpp` has built correctly on your machine, you will see the help options being displayed. A snippet of the output is shown below:

```output
example usage:
text generation: ./llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
chat (conversation): ./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
```


### Dependencies and Environment
Install the required Python packages:

```bash
sudo apt install python-is-python3 python3-pip python3-venv -y
```

Create and activate a Python virtual environment:

```bash
python -m venv venv
source venv/bin/activate
```

Your terminal prompt now has the `(venv)` prefix indicating the virtual environment is active. Use this virtual environment for the remaining commands.

Install the required python dependencies.

```shell
pip install --upgrade pymilvus openai requests langchain-huggingface huggingface_hub tqdm
```

You can now download the model using the huggingface cli:

```bash
huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False
```
Before you proceed and run this model, take a quick look at what `Q4_0` in the model name denotes.


### Re-quantize the model weights

To re-quantize optimally for Graviton3, run

```bash
./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8
```

This will output a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support.

> This requantization is optimal only for Graviton3. For Graviton2, requantization should optimally be done in `Q4_0_4_4` format, and for Graviton4, `Q4_0_4_8` is the optimal requantization format.
### Start the LLM Service
You can use the llama.cpp server program and submit requests using an OpenAI-compatible API. This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM.


Start the server from the command line, it listens on port 8080:

```shell
./llama-server -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -n 2048 -t 64 -c 65536 --port 8080
```
```text
'main: server is listening on 127.0.0.1:8080 - starting the main loop
```

You have started the LLM service on your Arm-based CPU. Next, we will build a RAG application on top of it.

## Start to build RAG

### Prepare the data

We use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge in our RAG, which is a good data source for a simple RAG pipeline.

Download the zip file and extract documents to the folder `milvus_docs`.

```shell
wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs
```

We load all markdown files from the folder `milvus_docs/en/faq`. For each document, we just simply use "# " to separate the content in the file, which can roughly separate the content of each main part of the markdown file.


```python
from glob import glob

text_lines = []

for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
with open(file_path, "r") as file:
file_text = file.read()

text_lines += file_text.split("# ")
```

### Prepare LLM and Embedding Model

We initialize the LLM client and prepare the embedding model.

For the LLM, we use the OpenAI SDK to request the Llama service launched before.
For the embedding model, we use a simple model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

```python
from openai import OpenAI

llm_client = OpenAI(base_url="http://localhost:8080/v1", api_key="no-key")


from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

```
Generate a test embedding and print its dimension and first few elements.

```python
test_embedding = embedding_model.embed_query("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])
```

```text
384
[0.03061249852180481, 0.013831384479999542, -0.02084377221763134, 0.016327863559126854, -0.010231520049273968, -0.0479842908680439, -0.017313342541456223, 0.03728749603033066, 0.04588735103607178, 0.034405000507831573]
```

## Load data into Zilliz
### Create the Collection
We use [Zilliz Cloud](https://zilliz.com/cloud), which is a high-performance fully managed Milvus service deployed on ARM architecture, to store and retrieve the vector data.

We set the `uri` and `token` as the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud.
```python
from pymilvus import MilvusClient

milvus_client = MilvusClient(
uri="<your_zilliz_public_endpoint>", token="<your_zilliz_api_key>"
)

collection_name = "my_rag_collection"

```
Check if the collection already exists and drop it if it does.
```python
if milvus_client.has_collection(collection_name):
milvus_client.drop_collection(collection_name)
```
Create a new collection with specified parameters.

If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.
```python
milvus_client.create_collection(
collection_name=collection_name,
dimension=embedding_dim,
metric_type="IP", # Inner product distance
consistency_level="Strong", # Strong consistency level
)
```
### Insert data
Iterate through the text lines, create embeddings, and then insert the data into Milvus.

Here is a new field `text`, which is a non-defined field in the collection schema. It will be automatically added to the reserved JSON dynamic field, which can be treated as a normal field at a high level.
```python
from tqdm import tqdm

data = []

text_embeddings = embedding_model.embed_documents(text_lines)

for i, (line, embedding) in enumerate(
tqdm(zip(text_lines, text_embeddings), desc="Creating embeddings")
):
data.append({"id": i, "vector": embedding, "text": line})

milvus_client.insert(collection_name=collection_name, data=data)
```
```text
Creating embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 72/72 [00:18<00:00, 3.91it/s]
```
## Build RAG

### Retrieve data for a query

Let's specify a frequent question about Milvus.
```python
question = "How is data stored in milvus?"
```
Search for the question in the collection and retrieve the semantic top-3 matches.

```python
search_res = milvus_client.search(
collection_name=collection_name,
data=[
embedding_model.embed_query(question)
], # Use the `emb_text` function to convert the question to an embedding vector
limit=3, # Return top 3 results
search_params={"metric_type": "IP", "params": {}}, # Inner product distance
output_fields=["text"], # Return the text field
)
```
Let's take a look at the search results of the query
```python
import json

retrieved_lines_with_distances = [
(res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))
```

```shell
[
[
" Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###",
0.6488019824028015
],
[
"How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###",
0.5974207520484924
],
[
"What is the maximum dataset size Milvus can handle?\n\n \nTheoretically, the maximum dataset size Milvus can handle is determined by the hardware it is run on, specifically system memory and storage:\n\n- Milvus loads all specified collections and partitions into memory before running queries. Therefore, memory size determines the maximum amount of data Milvus can query.\n- When new entities and and collection-related schema (currently only MinIO is supported for data persistence) are added to Milvus, system storage determines the maximum allowable size of inserted data.\n\n###",
0.5833579301834106
]
]
```
### Use LLM to get a RAG response

Convert the retrieved documents into a string format.
```python
context = "\n".join(
[line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)
```
Define system and user prompts for the Lanage Model. This prompt is assembled with the retrieved documents from Milvus.

```python
SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""
```
Use LLM to generate a response based on the prompts. We set the `model` parameter to `not-used` since it is a redundant parameter for the llama.cpp service.

```python
response = llm_client.chat.completions.create(
model="not-used",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT},
],
)
print(response.choices[0].message.content)

```
```text
Milvus stores data in two types: inserted data and metadata. Inserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends such as MinIO, AWS S3, Google Cloud Storage (GCS), Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage (COS). Metadata are generated within Milvus and each Milvus module has its own metadata that are stored in etcd.
```
Congratulations! You have built a RAG application on top of the ARM-based infrastructures.

0 comments on commit 9ff10e5

Please sign in to comment.