feat: Watsonx (#594)

Co-authored-by: niklub <[email protected]> Co-authored-by: Caitlin Wheeless <[email protected]> Co-authored-by: Micaela Kaplan <[email protected]> Co-authored-by: Micaela Kaplan <[email protected]>
HumanSignal · Aug 9, 2024 · 6f7637c · 6f7637c
1 parent a50e31e
commit 6f7637c
Show file tree

Hide file tree

Showing 14 changed files with 1,002 additions and 2 deletions.
diff --git a/label_studio_ml/examples/watsonx_llm/.dockerignore b/label_studio_ml/examples/watsonx_llm/.dockerignore
@@ -0,0 +1 @@
+*.md
diff --git a/label_studio_ml/examples/watsonx_llm/Dockerfile b/label_studio_ml/examples/watsonx_llm/Dockerfile
@@ -0,0 +1,48 @@
+# syntax=docker/dockerfile:1
+ARG PYTHON_VERSION=3.11
+
+FROM python:${PYTHON_VERSION}-slim AS python-base
+ARG TEST_ENV
+
+WORKDIR /app
+
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PORT=${PORT:-9090} \
+    PIP_CACHE_DIR=/.cache \
+    WORKERS=1 \
+    THREADS=8
+
+# Update the base OS
+RUN --mount=type=cache,target="/var/cache/apt",sharing=locked \
+    --mount=type=cache,target="/var/lib/apt/lists",sharing=locked \
+    set -eux; \
+    apt-get update; \
+    apt-get upgrade -y; \
+    apt install --no-install-recommends -y  \
+        git; \
+    apt-get autoremove -y
+
+# install base requirements
+COPY requirements-base.txt .
+RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
+    pip install -r requirements-base.txt
+
+# install custom requirements
+COPY requirements.txt .
+RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
+    pip install -r requirements.txt
+
+# install test requirements if needed
+COPY requirements-test.txt .
+# build only when TEST_ENV="true"
+RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
+    if [ "$TEST_ENV" = "true" ]; then \
+      pip install -r requirements-test.txt; \
+    fi
+
+COPY . .
+
+EXPOSE 9090
+
+CMD gunicorn --preload --bind :$PORT --workers $WORKERS --threads $THREADS --timeout 0 _wsgi:app
diff --git a/label_studio_ml/examples/watsonx_llm/README.md b/label_studio_ml/examples/watsonx_llm/README.md
@@ -0,0 +1,150 @@
+
+
+# Integrate WatsonX to Label Studio
+WatsonX offers a suite of machine learning tools, including access to many LLMs, prompt
+refinement interfaces, and datastores via WatsonX.data. When you integrate WatsonX with Label Studio, you get 
+access to these models and can automatically keep your annotated data up to date in your WatsonX.data tables. 
+
+To run the integration, you'll need to pull this repo and host it locally or in the cloud. Then, you can link the model 
+to your Label Studio project under the `models` section in the settings. To use the WatsonX.data integration, 
+set up a webhook in settings under `webhooks` by using the following structure for the link: 
+`<link to your hosted container>/data/upload` and set the triggers to `ANNOTATION_CREATED` and `ANNOTATION_UPDATED`. For more
+on webhooks, see [our documentation](https://labelstud.io/guide/webhooks)
+
+See the configuration notes at the bottom for details on how to set up your environment variables to get the system to work.
+
+## Setting up your label_config
+For this project, we reccoment you start with the labeling config as defined below, but you can always edit it or expand it to
+meet your needs! Crucially, there must be a `<TextArea>` tag for the model to insert its response into. 
+
+    <View>
+        <Style>
+            .lsf-main-content.lsf-requesting .prompt::before { content: ' loading...'; color: #808080; }
+
+            .text-container {
+            background-color: white;
+            border-radius: 10px;
+            box-shadow: 0px 4px 6px rgba(0, 0, 0, 0.1);
+            padding: 20px;
+            font-family: 'Courier New', monospace;
+            line-height: 1.6;
+            font-size: 16px;
+            }
+        </Style>
+        <Header value="Context:"/>
+        <View className="text-container">
+            <Text name="context" value="$text"/>
+        </View>
+        <Header value="Prompt:"/>
+        <View className="prompt">
+            <TextArea name="prompt"
+                      toName="context"
+                      rows="4"
+                      editable="true"
+                      maxSubmissions="1"
+                      showSubmitButton="false"
+                      placeholder="Type your prompt here then Shift+Enter..."
+            />
+        </View>
+        <Header value="Response:"/>
+        <TextArea name="response"
+                  toName="context"
+                  rows="4"
+                  editable="true"
+                  maxSubmissions="1"
+                  showSubmitButton="false"
+                  smart="false"
+                  placeholder="Generated response will appear here..."
+        />
+        
+        <Header value="Overall response quality:"/>
+        <Rating name="rating" toName="context"/>
+    </View>
+
+
+## Setting up WatsonX.Data
+To use your WatsonX.data integration, follow the steps below. 
+1. First, get the host and port information of the engine that you'll be using. To do this, navigate to the Infrastructure Manager 
+on the left sidebar of your WatsonX.data page and select the Infrastructure Manager. Change to list view by clicking the symbol in 
+the upper right hand corner. From there, click on the name of the engine you'll be using. This will bring up a pop up window, 
+where you can see the host and port information under "host". The port is the part after the `:` at the end of the url. 
+2. Next, make sure your catalog is set up. To create a new catalog, follow [these instructions](https://dataplatform.cloud.ibm.com/docs/content/wsj/catalog/create-catalog.html?context=wx&locale=en)
+3. Once your catalog is set up, make sure that the correct schema is also set up. Navigate to your Data Manager and select `create` to create a new schema
+4. With all of this information, you're ready to update the environment variables listed at the bottom of this page and get started with your WatsonX.data integration! 
+
+
+## Running with Docker (recommended)
+
+1. Start Machine Learning backend on `http://localhost:9090` with prebuilt image:
+
+```bash
+docker-compose up
+```
+
+2. Validate that backend is running
+
+```bash
+$ curl http://localhost:9090/
+{"status":"UP"}
+```
+
+3. Create a project in Label Studio. Then from the **Model** page in the project settings, [connect the model](https://labelstud.io/guide/ml#Connect-the-model-to-Label-Studio). The default URL is `http://localhost:9090`.
+
+
+## Building from source (advanced)
+
+To build the ML backend from source, you have to clone the repository and build the Docker image:
+
+```bash
+docker-compose build
+```
+
+## Running without Docker (advanced)
+
+To run the ML backend without Docker, you have to clone the repository and install all dependencies using pip:
+
+```bash
+python -m venv ml-backend
+source ml-backend/bin/activate
+pip install -r requirements.txt
+```
+
+Then you can start the ML backend:
+
+```bash
+label-studio-ml start ./dir_with_your_model
+```
+
+## Configuration
+
+Parameters can be set in `docker-compose.yml` before running the container.
+
+The following common parameters are available:
+- `BASIC_AUTH_USER` - Specify the basic auth user for the model server.
+- `BASIC_AUTH_PASS` - Specify the basic auth password for the model server.
+- `LOG_LEVEL` - Set the log level for the model server.
+- `WORKERS` - Specify the number of workers for the model server.
+- `THREADS` - Specify the number of threads for the model server.
+
+The following parameters allow you to link the WatsonX models to Label Studio:
+
+- `LABEL_STUDIO_URL` - Specify the URL of your Label Studio instance. Note that this might need to be `http://host.docker.internal:8080` if you are running Label Studio on another Docker container.
+- `LABEL_STUDIO_API_KEY`- Specify the API key for authenticating your Label Studio instance. You can find this by logging into Label Studio and and [going to the **Account & Settings** page](https://labelstud.io/guide/user_account#Access-token).
+- `WATSONX_API_KEY`- Specify the API key for authenticating into WatsonX. You can generate this by following the instructions at [here](https://www.ibm.com/docs/en/watsonx/watsonxdata/1.0.x?topic=started-generating-api-keys)
+- `WATSONX_PROJECT_ID`- Specify the ID of your WatsonX project from which you will run the model. Must have WML capabilities. You can find this in the `General` section of your project, which is accessible by clicking on the project from the homepage of WatsonX.
+- `WATSONX_MODELTYPE`- Specify the name of the WatsonX model you'd like to use. A full list can be found in [IBM's documentation](https://ibm.github.io/watsonx-ai-python-sdk/fm_model.html#TextModels:~:text=CODELLAMA_34B_INSTRUCT_HF)
+- `DEFAULT_PROMPT` - If you want the model to automatically predict on new data samples, you'll need to provide a default prompt or the location to a default prompt file. 
+- `USE_INTERNAL_PROMPT` - If using a default prompt, set to 0. Otherwise, set to 1.  
+
+The following parameters allow you to use the webhook connection to transfer data from Label Studio to WatsonX.data:
+
+-`WATSONX_ENG_USERNAME`- MUST be `ibmlhapikey` for the intergration to work.
+
+To get the host and port information below, you can folllow the steps under [Pre-requisites](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-con-presto-serv#conn-to-prestjava).
+
+- `WATSONX_ENG_HOST` - the host information for your WatsonX.data Engine
+- `WATSONX_ENG_PORT` - the port information for your WatsonX.data Engine
+- `WATSONX_CATALOG` - the name of the catalog for the table you'll insert your data into. Must be created in the WatsonX.data platform.
+- `WATSONX_SCHEMA` - the name of the schema for the table you'll insert your data into. Must be created in the WatsonX.data platofrm.
+- `WATSONX_TABLE` - the name of the table you'll insert your data into. Does not need to be already created.
+
diff --git a/label_studio_ml/examples/watsonx_llm/_wsgi.py b/label_studio_ml/examples/watsonx_llm/_wsgi.py
@@ -0,0 +1,16 @@
+from werkzeug.middleware.dispatcher import DispatcherMiddleware
+from flask import Flask
+
+from data_wsgi import application as data
+import model_wsgi as model
+
+"""
+Here, we create a Flask app to serve as a wrapper for both the ml-backend model api and the webhook api. By doing this, 
+we can host both behind the same endpoint, with the model accessible at <host_url>/ and the webhook accessible at 
+<host_url>/data/. We set app.wsgi_app in this way so that we can run our tests. 
+"""
+app = Flask(__name__)
+
+app.wsgi_app = DispatcherMiddleware(model.application.wsgi_app, {
+    '/data': data.wsgi_app
+})
diff --git a/label_studio_ml/examples/watsonx_llm/data_transfer_app.py b/label_studio_ml/examples/watsonx_llm/data_transfer_app.py
@@ -0,0 +1,174 @@
+import csv
+import logging
+import os
+import prestodb
+import traceback
+from flask import Flask, request, jsonify, Response
+from label_studio_sdk.client import LabelStudio
+from typing import List
+
+logger = logging.getLogger(__name__)
+_server = Flask(__name__)
+
+
+def init_app():
+    return _server
+
+
+@_server.route('/health', methods=['GET'])
+@_server.route('/', methods=['GET'])
+def health():
+    return jsonify({
+        'status': 'UP',
+        'model_class': MODEL_CLASS.__name__
+    })
+
+
+@_server.route('/upload', methods=['POST'])
+def upload_to_watsonx():
+    # First, collect data from the request object passed by label studio
+    input_request = request.json
+    action = input_request["action"]
+    annotation = input_request["annotation"]
+    task = input_request["task"]
+
+    # Connect to Label Studio
+    client = connect_ls()
+    data = get_data(annotation, task, client)
+
+    # Then, connect to WatsonX.data via prestodb
+    eng_username = os.getenv("WATSONX_ENG_USERNAME")
+    eng_password = os.getenv("WATSONX_API_KEY")
+    eng_host = os.getenv("WATSONX_ENG_HOST")
+    eng_port = os.getenv("WATSONX_ENG_PORT")
+    catalog = os.getenv("WATSONX_CATALOG")
+    schema = os.getenv("WATSONX_SCHEMA")
+    table = os.getenv("WATSONX_TABLE")
+
+    if None in [eng_username, eng_password, eng_host, eng_port, catalog, schema, table]:
+        raise Exception("You must provide the required WATSONX variables in your docker-compose.yml file!")
+
+    try:
+        with prestodb.dbapi.connect(host=eng_host, port=eng_port, user=eng_username, catalog=catalog,
+                                    schema=schema, http_scheme='https',
+                                    auth=prestodb.auth.BasicAuthentication(eng_username, eng_password)) as conn:
+
+            cur = conn.cursor()
+            # dynamically create table schema
+            table_create, table_info_keys = create_table(table, data)
+            cur.execute(table_create)
+
+            if action == "ANNOTATION_CREATED":
+                # upload new annotation to watsonx
+                values = tuple([data[key] for key in table_info_keys])
+                insert_command = f"""INSERT INTO {table} VALUES {values}"""
+                logger.debug(insert_command)
+                cur.execute(insert_command)
+
+            elif action == "ANNOTATION_UPDATED":
+                # update existing annotation in watsonx by deleting the old one and uploading a new one
+                delete = f"""DELETE from {table} WHERE ID={data["ID"]}"""
+                logger.debug(delete)
+                cur.execute(delete)
+                values = tuple([data[key] for key in table_info_keys])
+                insert_command = f"""INSERT INTO {table} VALUES {values}"""
+                logger.debugint(insert_command)
+                cur.execute(insert_command)
+
+            elif action == "ANNOTATIONS_DELETED":
+                # delete existing annotation in watsonx
+                delete = f"""DELETE from {table} WHERE ID={data["ID"]}"""
+                logger.debug(delete)
+                cur.execute(delete)
+
+            conn.commit()
+    except Exception as e:
+        logger.debug(traceback.format_exc())
+        logger.debug(e)
+
+
+def connect_ls():
+    try:
+        base_url = os.getenv("LABEL_STUDIO_URL")
+        api_key = os.getenv("LABEL_STUDIO_API_KEY")
+
+        if None in [base_url, api_key]:
+            raise Exception(
+                "You must provide your LABEL_STUDIO_URL and LABEL_STUDIO_API_KEY in your docker-compose.yml file!")
+
+        client = LabelStudio(
+            base_url=base_url,
+            api_key=api_key
+        )
+
+        return client
+
+    except Exception as e:
+        logger.debug(traceback.format_exc())
+        logger.debug(e)
+
+
+def get_data(annotation, task, client):
+    """Collect the data to be uploaded to WatsonX.data"""
+    info = {}
+
+    try:
+        users = client.users.list()
+        id = task["id"]
+        annotator_complete = annotation["completed_by"]
+        annotator_update = annotation["updated_by"]
+        annotator_complete = next((x.email for x in users if x.id == annotator_complete), "")
+        annotator_update = next((x.email for x in users if x.id == annotator_update), "")
+        info.update({"ID": int(id), "completed_by": annotator_complete, "updated_by": annotator_update})
+        for key, value in task["data"].items():
+            if isinstance(value, List):
+                value = value[0]
+            elif isinstance(value, str) and value.isnumeric():
+                value = int(value)
+
+            if isinstance(value, str):
+                value = value.strip("\"")
+            info.update({key: value})
+
+        for result in annotation["result"]:
+            logger.debug(result)
+            val_dict_key = list(result["value"].keys())[0]
+            value = result["value"][val_dict_key]
+            key = result["from_name"]
+            if isinstance(value, List):
+                value = value[0]
+            elif isinstance(value, str) and value.isnumeric():
+                value = int(value)
+
+            if isinstance(value, str):
+                value = value.strip("\"")
+            info.update({key: value})
+
+        logger.debug(f"INFO {info}")
+        return info
+    except Exception as e:
+        logger.debug(traceback.format_exc())
+        logger.debug(e)
+
+
+def create_table(table, data):
+    """
+    Create the command for building a new table
+    """
+    table_info = {}
+    for key, value in data.items():
+        if isinstance(value, int):
+            table_info.update({key: "bigint"})
+        else:
+            table_info.update({key: "varchar"})
+
+    table_info_keys = sorted(table_info.keys())
+    table_info_keys.insert(0, table_info_keys.pop(table_info_keys.index("ID")))
+    nl = ",\n"
+    strings = [f"{key} {table_info[key]}" for key in table_info_keys]
+    table_create = f"""
+    CREATE TABLE IF NOT EXISTS {table} ({nl.join(strings)})
+
+    """
+    logger.debug(table_create)
+    return table_create, table_info_keys