Skip to content

Commit

Permalink
feat: Watsonx (#594)
Browse files Browse the repository at this point in the history
Co-authored-by: niklub <[email protected]>
Co-authored-by: Caitlin Wheeless <[email protected]>
Co-authored-by: Micaela Kaplan <[email protected]>
Co-authored-by: Micaela Kaplan <[email protected]>
  • Loading branch information
5 people authored Aug 9, 2024
1 parent a50e31e commit 6f7637c
Show file tree
Hide file tree
Showing 14 changed files with 1,002 additions and 2 deletions.
1 change: 1 addition & 0 deletions label_studio_ml/examples/watsonx_llm/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.md
48 changes: 48 additions & 0 deletions label_studio_ml/examples/watsonx_llm/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# syntax=docker/dockerfile:1
ARG PYTHON_VERSION=3.11

FROM python:${PYTHON_VERSION}-slim AS python-base
ARG TEST_ENV

WORKDIR /app

ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
PORT=${PORT:-9090} \
PIP_CACHE_DIR=/.cache \
WORKERS=1 \
THREADS=8

# Update the base OS
RUN --mount=type=cache,target="/var/cache/apt",sharing=locked \
--mount=type=cache,target="/var/lib/apt/lists",sharing=locked \
set -eux; \
apt-get update; \
apt-get upgrade -y; \
apt install --no-install-recommends -y \
git; \
apt-get autoremove -y

# install base requirements
COPY requirements-base.txt .
RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
pip install -r requirements-base.txt

# install custom requirements
COPY requirements.txt .
RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
pip install -r requirements.txt

# install test requirements if needed
COPY requirements-test.txt .
# build only when TEST_ENV="true"
RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
if [ "$TEST_ENV" = "true" ]; then \
pip install -r requirements-test.txt; \
fi

COPY . .

EXPOSE 9090

CMD gunicorn --preload --bind :$PORT --workers $WORKERS --threads $THREADS --timeout 0 _wsgi:app
150 changes: 150 additions & 0 deletions label_studio_ml/examples/watsonx_llm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@


# Integrate WatsonX to Label Studio
WatsonX offers a suite of machine learning tools, including access to many LLMs, prompt
refinement interfaces, and datastores via WatsonX.data. When you integrate WatsonX with Label Studio, you get
access to these models and can automatically keep your annotated data up to date in your WatsonX.data tables.

To run the integration, you'll need to pull this repo and host it locally or in the cloud. Then, you can link the model
to your Label Studio project under the `models` section in the settings. To use the WatsonX.data integration,
set up a webhook in settings under `webhooks` by using the following structure for the link:
`<link to your hosted container>/data/upload` and set the triggers to `ANNOTATION_CREATED` and `ANNOTATION_UPDATED`. For more
on webhooks, see [our documentation](https://labelstud.io/guide/webhooks)

See the configuration notes at the bottom for details on how to set up your environment variables to get the system to work.

## Setting up your label_config
For this project, we reccoment you start with the labeling config as defined below, but you can always edit it or expand it to
meet your needs! Crucially, there must be a `<TextArea>` tag for the model to insert its response into.

<View>
<Style>
.lsf-main-content.lsf-requesting .prompt::before { content: ' loading...'; color: #808080; }

.text-container {
background-color: white;
border-radius: 10px;
box-shadow: 0px 4px 6px rgba(0, 0, 0, 0.1);
padding: 20px;
font-family: 'Courier New', monospace;
line-height: 1.6;
font-size: 16px;
}
</Style>
<Header value="Context:"/>
<View className="text-container">
<Text name="context" value="$text"/>
</View>
<Header value="Prompt:"/>
<View className="prompt">
<TextArea name="prompt"
toName="context"
rows="4"
editable="true"
maxSubmissions="1"
showSubmitButton="false"
placeholder="Type your prompt here then Shift+Enter..."
/>
</View>
<Header value="Response:"/>
<TextArea name="response"
toName="context"
rows="4"
editable="true"
maxSubmissions="1"
showSubmitButton="false"
smart="false"
placeholder="Generated response will appear here..."
/>
<Header value="Overall response quality:"/>
<Rating name="rating" toName="context"/>
</View>


## Setting up WatsonX.Data
To use your WatsonX.data integration, follow the steps below.
1. First, get the host and port information of the engine that you'll be using. To do this, navigate to the Infrastructure Manager
on the left sidebar of your WatsonX.data page and select the Infrastructure Manager. Change to list view by clicking the symbol in
the upper right hand corner. From there, click on the name of the engine you'll be using. This will bring up a pop up window,
where you can see the host and port information under "host". The port is the part after the `:` at the end of the url.
2. Next, make sure your catalog is set up. To create a new catalog, follow [these instructions](https://dataplatform.cloud.ibm.com/docs/content/wsj/catalog/create-catalog.html?context=wx&locale=en)
3. Once your catalog is set up, make sure that the correct schema is also set up. Navigate to your Data Manager and select `create` to create a new schema
4. With all of this information, you're ready to update the environment variables listed at the bottom of this page and get started with your WatsonX.data integration!


## Running with Docker (recommended)

1. Start Machine Learning backend on `http://localhost:9090` with prebuilt image:

```bash
docker-compose up
```

2. Validate that backend is running

```bash
$ curl http://localhost:9090/
{"status":"UP"}
```

3. Create a project in Label Studio. Then from the **Model** page in the project settings, [connect the model](https://labelstud.io/guide/ml#Connect-the-model-to-Label-Studio). The default URL is `http://localhost:9090`.


## Building from source (advanced)

To build the ML backend from source, you have to clone the repository and build the Docker image:

```bash
docker-compose build
```

## Running without Docker (advanced)

To run the ML backend without Docker, you have to clone the repository and install all dependencies using pip:

```bash
python -m venv ml-backend
source ml-backend/bin/activate
pip install -r requirements.txt
```

Then you can start the ML backend:

```bash
label-studio-ml start ./dir_with_your_model
```

## Configuration

Parameters can be set in `docker-compose.yml` before running the container.

The following common parameters are available:
- `BASIC_AUTH_USER` - Specify the basic auth user for the model server.
- `BASIC_AUTH_PASS` - Specify the basic auth password for the model server.
- `LOG_LEVEL` - Set the log level for the model server.
- `WORKERS` - Specify the number of workers for the model server.
- `THREADS` - Specify the number of threads for the model server.

The following parameters allow you to link the WatsonX models to Label Studio:

- `LABEL_STUDIO_URL` - Specify the URL of your Label Studio instance. Note that this might need to be `http://host.docker.internal:8080` if you are running Label Studio on another Docker container.
- `LABEL_STUDIO_API_KEY`- Specify the API key for authenticating your Label Studio instance. You can find this by logging into Label Studio and and [going to the **Account & Settings** page](https://labelstud.io/guide/user_account#Access-token).
- `WATSONX_API_KEY`- Specify the API key for authenticating into WatsonX. You can generate this by following the instructions at [here](https://www.ibm.com/docs/en/watsonx/watsonxdata/1.0.x?topic=started-generating-api-keys)
- `WATSONX_PROJECT_ID`- Specify the ID of your WatsonX project from which you will run the model. Must have WML capabilities. You can find this in the `General` section of your project, which is accessible by clicking on the project from the homepage of WatsonX.
- `WATSONX_MODELTYPE`- Specify the name of the WatsonX model you'd like to use. A full list can be found in [IBM's documentation](https://ibm.github.io/watsonx-ai-python-sdk/fm_model.html#TextModels:~:text=CODELLAMA_34B_INSTRUCT_HF)
- `DEFAULT_PROMPT` - If you want the model to automatically predict on new data samples, you'll need to provide a default prompt or the location to a default prompt file.
- `USE_INTERNAL_PROMPT` - If using a default prompt, set to 0. Otherwise, set to 1.

The following parameters allow you to use the webhook connection to transfer data from Label Studio to WatsonX.data:

-`WATSONX_ENG_USERNAME`- MUST be `ibmlhapikey` for the intergration to work.

To get the host and port information below, you can folllow the steps under [Pre-requisites](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-con-presto-serv#conn-to-prestjava).

- `WATSONX_ENG_HOST` - the host information for your WatsonX.data Engine
- `WATSONX_ENG_PORT` - the port information for your WatsonX.data Engine
- `WATSONX_CATALOG` - the name of the catalog for the table you'll insert your data into. Must be created in the WatsonX.data platform.
- `WATSONX_SCHEMA` - the name of the schema for the table you'll insert your data into. Must be created in the WatsonX.data platofrm.
- `WATSONX_TABLE` - the name of the table you'll insert your data into. Does not need to be already created.

16 changes: 16 additions & 0 deletions label_studio_ml/examples/watsonx_llm/_wsgi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from werkzeug.middleware.dispatcher import DispatcherMiddleware
from flask import Flask

from data_wsgi import application as data
import model_wsgi as model

"""
Here, we create a Flask app to serve as a wrapper for both the ml-backend model api and the webhook api. By doing this,
we can host both behind the same endpoint, with the model accessible at <host_url>/ and the webhook accessible at
<host_url>/data/. We set app.wsgi_app in this way so that we can run our tests.
"""
app = Flask(__name__)

app.wsgi_app = DispatcherMiddleware(model.application.wsgi_app, {
'/data': data.wsgi_app
})
174 changes: 174 additions & 0 deletions label_studio_ml/examples/watsonx_llm/data_transfer_app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
import csv
import logging
import os
import prestodb
import traceback
from flask import Flask, request, jsonify, Response
from label_studio_sdk.client import LabelStudio
from typing import List

logger = logging.getLogger(__name__)
_server = Flask(__name__)


def init_app():
return _server


@_server.route('/health', methods=['GET'])
@_server.route('/', methods=['GET'])
def health():
return jsonify({
'status': 'UP',
'model_class': MODEL_CLASS.__name__
})


@_server.route('/upload', methods=['POST'])
def upload_to_watsonx():
# First, collect data from the request object passed by label studio
input_request = request.json
action = input_request["action"]
annotation = input_request["annotation"]
task = input_request["task"]

# Connect to Label Studio
client = connect_ls()
data = get_data(annotation, task, client)

# Then, connect to WatsonX.data via prestodb
eng_username = os.getenv("WATSONX_ENG_USERNAME")
eng_password = os.getenv("WATSONX_API_KEY")
eng_host = os.getenv("WATSONX_ENG_HOST")
eng_port = os.getenv("WATSONX_ENG_PORT")
catalog = os.getenv("WATSONX_CATALOG")
schema = os.getenv("WATSONX_SCHEMA")
table = os.getenv("WATSONX_TABLE")

if None in [eng_username, eng_password, eng_host, eng_port, catalog, schema, table]:
raise Exception("You must provide the required WATSONX variables in your docker-compose.yml file!")

try:
with prestodb.dbapi.connect(host=eng_host, port=eng_port, user=eng_username, catalog=catalog,
schema=schema, http_scheme='https',
auth=prestodb.auth.BasicAuthentication(eng_username, eng_password)) as conn:

cur = conn.cursor()
# dynamically create table schema
table_create, table_info_keys = create_table(table, data)
cur.execute(table_create)

if action == "ANNOTATION_CREATED":
# upload new annotation to watsonx
values = tuple([data[key] for key in table_info_keys])
insert_command = f"""INSERT INTO {table} VALUES {values}"""
logger.debug(insert_command)
cur.execute(insert_command)

elif action == "ANNOTATION_UPDATED":
# update existing annotation in watsonx by deleting the old one and uploading a new one
delete = f"""DELETE from {table} WHERE ID={data["ID"]}"""
logger.debug(delete)
cur.execute(delete)
values = tuple([data[key] for key in table_info_keys])
insert_command = f"""INSERT INTO {table} VALUES {values}"""
logger.debugint(insert_command)
cur.execute(insert_command)

elif action == "ANNOTATIONS_DELETED":
# delete existing annotation in watsonx
delete = f"""DELETE from {table} WHERE ID={data["ID"]}"""
logger.debug(delete)
cur.execute(delete)

conn.commit()
except Exception as e:
logger.debug(traceback.format_exc())
logger.debug(e)


def connect_ls():
try:
base_url = os.getenv("LABEL_STUDIO_URL")
api_key = os.getenv("LABEL_STUDIO_API_KEY")

if None in [base_url, api_key]:
raise Exception(
"You must provide your LABEL_STUDIO_URL and LABEL_STUDIO_API_KEY in your docker-compose.yml file!")

client = LabelStudio(
base_url=base_url,
api_key=api_key
)

return client

except Exception as e:
logger.debug(traceback.format_exc())
logger.debug(e)


def get_data(annotation, task, client):
"""Collect the data to be uploaded to WatsonX.data"""
info = {}

try:
users = client.users.list()
id = task["id"]
annotator_complete = annotation["completed_by"]
annotator_update = annotation["updated_by"]
annotator_complete = next((x.email for x in users if x.id == annotator_complete), "")
annotator_update = next((x.email for x in users if x.id == annotator_update), "")
info.update({"ID": int(id), "completed_by": annotator_complete, "updated_by": annotator_update})
for key, value in task["data"].items():
if isinstance(value, List):
value = value[0]
elif isinstance(value, str) and value.isnumeric():
value = int(value)

if isinstance(value, str):
value = value.strip("\"")
info.update({key: value})

for result in annotation["result"]:
logger.debug(result)
val_dict_key = list(result["value"].keys())[0]
value = result["value"][val_dict_key]
key = result["from_name"]
if isinstance(value, List):
value = value[0]
elif isinstance(value, str) and value.isnumeric():
value = int(value)

if isinstance(value, str):
value = value.strip("\"")
info.update({key: value})

logger.debug(f"INFO {info}")
return info
except Exception as e:
logger.debug(traceback.format_exc())
logger.debug(e)


def create_table(table, data):
"""
Create the command for building a new table
"""
table_info = {}
for key, value in data.items():
if isinstance(value, int):
table_info.update({key: "bigint"})
else:
table_info.update({key: "varchar"})

table_info_keys = sorted(table_info.keys())
table_info_keys.insert(0, table_info_keys.pop(table_info_keys.index("ID")))
nl = ",\n"
strings = [f"{key} {table_info[key]}" for key in table_info_keys]
table_create = f"""
CREATE TABLE IF NOT EXISTS {table} ({nl.join(strings)})
"""
logger.debug(table_create)
return table_create, table_info_keys
Loading

0 comments on commit 6f7637c

Please sign in to comment.