Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integration tests for PyTorch, TGI and TEI DLCs #79

Open
wants to merge 82 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 70 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
a036a98
Add `tests/local` structure
alvarobartt Aug 26, 2024
beed550
Add `tests/local/training/test_trl.py` (WIP)
alvarobartt Aug 26, 2024
2427601
Update `tests/local/training/test_trl.py`
alvarobartt Aug 27, 2024
e18b8d5
Rename `tests/local` to `tests/pytorch`
alvarobartt Aug 27, 2024
698613a
Add `tests/pytorch/inference/test_transformers.py`
alvarobartt Aug 27, 2024
7ce8ec8
Update `test_transformers.py`
alvarobartt Aug 28, 2024
f00b801
Update and rename to `test_huggingface_inference_toolkit.py`
alvarobartt Aug 28, 2024
224cbca
Add `tests/requirements.txt`
alvarobartt Aug 28, 2024
dd0cd1f
Skip `tests/pytorch/training` if `not CUDA_AVAILABLE`
alvarobartt Aug 28, 2024
da1845f
Handle `CUDA_AVAILABLE` in `tests/pytorch/inference`
alvarobartt Aug 28, 2024
d139796
Add `docker` in `tests/requirements.txt`
alvarobartt Aug 28, 2024
3367f91
Remove `volumes` mounted for local testing
alvarobartt Aug 28, 2024
dd96f7a
Add `pytest.init` configuration file
alvarobartt Aug 28, 2024
f87f9d2
Add `.github/actions/pytorch-dlcs-tests.yml`
alvarobartt Aug 28, 2024
926960d
Add `.github/workflows/run-pytorch-dlcs-tests.yml`
alvarobartt Aug 28, 2024
e2712ac
Update `tests/pytorch/training/test_trl.py` (WIP)
alvarobartt Aug 28, 2024
440a353
Fix `tests/pytorch/training/test_trl.py`
alvarobartt Aug 28, 2024
3e3071d
Fix `tests/pytorch/inference/test_huggingface_inference_toolkit.py`
alvarobartt Aug 28, 2024
893d046
Add background log-streaming via `threading`
alvarobartt Aug 28, 2024
e6097d5
Move `stream_logs` to `tests/utils.py`
alvarobartt Aug 28, 2024
b4edbc3
Add `tests/tgi/test_tgi.py` (WIP)
alvarobartt Aug 28, 2024
b8e3b93
Add `transformers` to `tests/requirements.txt`
alvarobartt Aug 28, 2024
d5c4c50
Fix decoding of `container.logs()`
alvarobartt Aug 28, 2024
6ec0dca
Update `tests/tgi/test_tgi.py`
alvarobartt Aug 28, 2024
db72a57
Add `.github/workflows/run-tgi-dlc-tests.yml`
alvarobartt Aug 28, 2024
82e433a
Update `.github/workflows`
alvarobartt Aug 28, 2024
ce31efd
Update `tests/tgi/test_tgi.py`
alvarobartt Aug 28, 2024
09adb69
Fix decoding of `container_logs`
alvarobartt Aug 28, 2024
19ef319
Use relative imports in `tests`
alvarobartt Aug 28, 2024
ef0e437
Add `tests/tei`
alvarobartt Aug 28, 2024
d08a52c
Update runner groups for CPU and GPU instances
alvarobartt Aug 30, 2024
17f9ca4
Update `.github/workflows`
alvarobartt Aug 30, 2024
84834a1
Update `uses` path in `.github/workflows/test-huggingface-dlcs.yml`
alvarobartt Aug 30, 2024
6ec0e1c
Add missing `type` to `inputs`
alvarobartt Aug 30, 2024
05e1e18
Add missing quotes around `python-version`
alvarobartt Aug 30, 2024
02b149e
Update `diffusers` model in `tests`
alvarobartt Aug 30, 2024
640bd04
Update `.github/workflows/test-huggingface-dlcs.yml`
alvarobartt Aug 30, 2024
1797a0d
Upgrade `actions/checkout` and `actions/setup-python`
alvarobartt Sep 1, 2024
91156b4
Use smaller `sentence-transformer` model for TEI tests
alvarobartt Sep 1, 2024
a8b83e4
Fix port-binding of `ports` in `test_tei.py`
alvarobartt Sep 1, 2024
a62c677
Replace `CMD` in `healthcheck` with `/bin/bash`
alvarobartt Sep 1, 2024
61827ea
Add `os.makedirs` before volume mount
alvarobartt Sep 1, 2024
ae11f99
Use `CMD` instead of `/bin/bash` (revert)
alvarobartt Sep 1, 2024
6473e64
Add `detach=True` and then `wait` for container to end
alvarobartt Sep 1, 2024
9438030
Update `test_trl.py`
alvarobartt Sep 1, 2024
e1caeaa
Ensure that `tmp_path` exists and has right permissions
alvarobartt Sep 1, 2024
903e10e
Write empty default file in `tmp_path` (debug)
alvarobartt Sep 1, 2024
8fae6d7
Add `torch` dependency in `requirements.txt`
alvarobartt Sep 1, 2024
292db5d
Add `uv` in `.github/workflows/run-tests-action.yml`
alvarobartt Sep 1, 2024
1edabbc
Set `PATH` before using `uv` after installation
alvarobartt Sep 1, 2024
741a57c
Update `.github/workflows/run-tests-action.yml`
alvarobartt Sep 1, 2024
4cb570c
Update `.github/workflows/run-tests-action.yml`
alvarobartt Sep 1, 2024
5a291af
Remove `torch` dependency and torch-related code
alvarobartt Sep 1, 2024
c089784
Remove wrong `uv sync` (not a Python project)
alvarobartt Sep 1, 2024
89f9c81
Remove `transformers` dependency
alvarobartt Sep 1, 2024
da8b854
Remove `NUM_SHARD` as not required
alvarobartt Sep 1, 2024
56e06d0
Comment `healthcheck` and `platform` (debug)
alvarobartt Sep 1, 2024
bd7e210
Add `transformers` dependency in `tests/requirements.txt` (revert)
alvarobartt Sep 2, 2024
83e2c95
Add `docker` checks for debugging
alvarobartt Sep 2, 2024
fa3b178
Remove `runtime=nvidia` and enable interactive mode (`docker run -it …
alvarobartt Sep 2, 2024
438c9ad
Remove manual mock file creation for debugging
alvarobartt Sep 2, 2024
38abf36
Revert `docker` checks in `run-tests-action.yml`
alvarobartt Sep 2, 2024
4224bc7
Remove `tty` and `stdin_open` interactive mode
alvarobartt Sep 2, 2024
beef705
Update `tmp_path` with `--basetmp` (debug)
alvarobartt Sep 2, 2024
9446a3e
Fix `TGI_DLC` environment variable value
alvarobartt Sep 2, 2024
99d353c
Check `container.status` to prevent extra healtchecks
alvarobartt Sep 2, 2024
c99e0ed
Add `nvidia-ml-py` to set `USE_FLASH_ATTENTION` based on compute cap
alvarobartt Sep 2, 2024
4212a58
Add `jinja2` dependency in `tests/requirements.txt`
alvarobartt Sep 2, 2024
3909567
Update `trigger` in `.github/workflows/test-huggingface-dlcs.yml`
alvarobartt Sep 2, 2024
7c4bf87
Merge branch 'main' into add-integration-tests
alvarobartt Sep 2, 2024
7ce5aeb
Apply suggestions from code review
alvarobartt Sep 2, 2024
349df29
Add missing `tei-dlc` after removing defaults
alvarobartt Sep 2, 2024
eeb711d
Remove `GPUtil` and `nvidia-ml-py` in favour of `subprocess` on `nvid…
alvarobartt Sep 3, 2024
6b55963
Fix integration tests
alvarobartt Sep 3, 2024
35bc4d8
Rename `run-tests-action.yml` to `run-tests-reusable.yml`
alvarobartt Sep 3, 2024
b71a392
Add `options` and update `name` in `run-tests-reusable.yml`
alvarobartt Sep 3, 2024
cb7ddb6
Update `.github/workflows` to be more granular
alvarobartt Sep 9, 2024
d654b94
Set `type: choice` to use `options`
alvarobartt Sep 9, 2024
0fc8ef5
Update name for `test-pytorch-{inference,training}-dlcs.yml`
alvarobartt Sep 9, 2024
34281bb
Fix `.github/workflows/run-tests-reusable.yml`
alvarobartt Sep 9, 2024
4768af1
Add missing `type: ignore`
alvarobartt Sep 9, 2024
9f6dcc0
Update `tei-dlc` on CPU and update port mapping
alvarobartt Sep 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/workflows/run-tests-action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: Action to Run Hugging Face DLCs Tests

on:
workflow_call:
inputs:
group:
description: "The GitHub Runners Group to run on."
required: true
type: string
training-dlc:
description: "The URI of the Hugging Face PyTorch DLC for Training (GPU only)."
required: false
type: string
inference-dlc:
description: "The URI of the Hugging Face PyTorch DLC for Inference (CPU and GPU)."
required: true
type: string
tgi-dlc:
description: "The URI of the Hugging Face TGI DLC (GPU only)."
required: false
type: string

jobs:
run-tests:
runs-on:
group: ${{ inputs.group }}

steps:
- name: Check out the repository
uses: actions/[email protected]

- name: Set up Python
uses: actions/[email protected]
with:
python-version: "3.10"

- name: Set up uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH=$HOME/.cargo/bin:$PATH
uv --version

- name: Install dependencies
run: |
uv venv --python 3.10
uv pip install -r tests/requirements.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a "cache"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the VMs are ephemeral so the cache would be destroyed after each job is done, and uv is already pretty fast (downloads those under 10 seconds).


- name: Run Hugging Face DLCs Tests
run: uv run pytest -s tests/ --basetemp=${{ runner.temp }}
env:
TRAINING_DLC: ${{ inputs.training-dlc }}
INFERENCE_DLC: ${{ inputs.inference-dlc }}
TGI_DLC: ${{ inputs.tgi-dlc }}
39 changes: 39 additions & 0 deletions .github/workflows/test-huggingface-dlcs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Test Hugging Face DLCs

on:
push:
branches:
- main
pull_request:
types:
- synchronize
- ready_for_review
branches:
- main
paths:
- tests/*
- pytest.ini
- .github/workflows/run-tests-action.yml
- .github/workflows/test-huggingface-dlcs.yml
workflow_dispatch:

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
dlcs-on-cpu:
name: Run Hugging Face DLCs Tests on CPU
uses: huggingface/Google-Cloud-Containers/.github/workflows/run-tests-action.yml@add-integration-tests
with:
group: aws-general-8-plus
inference-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cpu.2-2.transformers.4-44.ubuntu2204.py311

dlcs-on-gpu:
name: Run Hugging Face DLCs Tests on GPU
uses: huggingface/Google-Cloud-Containers/.github/workflows/run-tests-action.yml@add-integration-tests
with:
group: aws-g4dn-2xlarge
training-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.transformers.4-42.ubuntu2204.py310
inference-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311
tgi-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm is there a better way to specify those? Feels like we can easily forget updating them?

5 changes: 5 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[pytest]
log_cli = true
log_cli_level = INFO
log_format = %(asctime)s %(levelname)s %(message)s
log_date_format = %Y-%m-%d %H:%M:%S
Empty file added tests/__init__.py
Empty file.
3 changes: 3 additions & 0 deletions tests/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
import GPUtil

CUDA_AVAILABLE = len(GPUtil.getAvailable()) > 0
Empty file added tests/pytorch/__init__.py
Empty file.
Empty file.
146 changes: 146 additions & 0 deletions tests/pytorch/inference/test_huggingface_inference_toolkit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
import logging
import os
import threading
import time

import docker
import pytest
import requests

from docker.types.containers import DeviceRequest

from ...constants import CUDA_AVAILABLE
from ...utils import stream_logs

MAX_RETRIES = 10


# Tests below are only on some combinations of models and tasks, since most of those
# tests are already available within https://github.com/huggingface/huggingface-inference-toolkit
# as `huggingface-inference-toolkit` is the inference engine powering the PyTorch DLCs for Inference
@pytest.mark.parametrize(
("hf_model_id", "hf_task", "prediction_payload"),
[
(
"distilbert/distilbert-base-uncased-finetuned-sst-2-english",
"text-classification",
{
"instances": ["I love this product", "I hate this product"],
"parameters": {"top_k": 2},
},
),
(
"BAAI/bge-base-en-v1.5",
"sentence-embeddings",
{"instances": ["I love this product"]},
),
(
"lambdalabs/miniSD-diffusers",
"text-to-image",
{
"instances": ["A cat holding a sign that says hello world"],
"parameters": {
"negative_prompt": "",
"num_inference_steps": 2,
"guidance_scale": 0.7,
},
},
),
],
)
def test_transformers(
caplog: pytest.LogCaptureFixture,
hf_model_id: str,
hf_task: str,
prediction_payload: dict,
) -> None:
caplog.set_level(logging.INFO)

client = docker.from_env()

logging.info(f"Starting container for {hf_model_id}...")
container = client.containers.run(
os.getenv(
"INFERENCE_DLC",
"us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cpu.2-2.transformers.4-44.ubuntu2204.py311"
if not CUDA_AVAILABLE
else "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311",
alvarobartt marked this conversation as resolved.
Show resolved Hide resolved
),
ports={"8080": 8080},
environment={
"HF_MODEL_ID": hf_model_id,
"HF_TASK": hf_task,
"AIP_MODE": "PREDICTION",
"AIP_HTTP_PORT": "8080",
"AIP_PREDICT_ROUTE": "/predict",
"AIP_HEALTH_ROUTE": "/health",
},
healthcheck={
"test": ["CMD", "curl", "-s", "http://localhost:8080/health"],
"interval": int(30 * 1e9),
"timeout": int(30 * 1e9),
"retries": 3,
"start_period": int(30 * 1e9),
},
platform="linux/amd64",
detach=True,
# Extra `device_requests` related to the CUDA devices if any
device_requests=[DeviceRequest(count=-1, capabilities=[["gpu"]])]
if CUDA_AVAILABLE
else None,
)

# Start log streaming in a separate thread
log_thread = threading.Thread(target=stream_logs, args=(container,))
log_thread.daemon = True
log_thread.start()

logging.info(f"Container {container.id} started...") # type: ignore
container_healthy = False
for _ in range(MAX_RETRIES):
# It the container failed to start properly, then the health check will fail
if container.status == "exited": # type: ignore
container_healthy = False
break

try:
logging.info(
f"Trying to connect to http://localhost:8080/health [retry {_ + 1}/{MAX_RETRIES}]..."
)
response = requests.get("http://localhost:8080/health")
assert response.status_code == 200
container_healthy = True
break
except requests.exceptions.ConnectionError:
time.sleep(30)

if not container_healthy:
logging.error("Container is not healthy after several retries...")
container.stop() # type: ignore
assert container_healthy

container_failed = False
try:
logging.info("Sending prediction request to http://localhost:8080/predict...")
start_time = time.perf_counter()
response = requests.post(
"http://localhost:8080/predict",
json=prediction_payload,
)
end_time = time.perf_counter()
assert response.status_code in [200, 201]
assert "predictions" in response.json()
logging.info(f"Prediction request took {end_time - start_time:.2f}s")
except Exception as e:
logging.error(
f"Error while sending prediction request with exception: {e}" # type: ignore
)
container_failed = True
finally:
if log_thread.is_alive():
log_thread.join(timeout=5)
logging.info(f"Stopping container {container.id}...") # type: ignore
container.stop() # type: ignore
container.remove() # type: ignore

assert not container_failed
Empty file.
142 changes: 142 additions & 0 deletions tests/pytorch/training/test_trl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
import logging
import os
import pytest
import threading

import docker
from docker.types.containers import DeviceRequest
from pathlib import PosixPath

from ...constants import CUDA_AVAILABLE
from ...utils import stream_logs


MODEL_ID = "sshleifer/tiny-gpt2"


@pytest.mark.skipif(not CUDA_AVAILABLE, reason="CUDA is not available")
def test_trl(caplog: pytest.LogCaptureFixture, tmp_path: PosixPath) -> None:
"""Adapted from https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py"""
caplog.set_level(logging.INFO)

client = docker.from_env()

logging.info("Running the container for TRL...")
container = client.containers.run(
os.getenv(
"TRAINING_DLC",
"us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310",
),
alvarobartt marked this conversation as resolved.
Show resolved Hide resolved
command=[
"trl",
"sft",
f"--model_name_or_path={MODEL_ID}",
"--dataset_text_field=text",
"--report_to=none",
"--learning_rate=1e-5",
"--per_device_train_batch_size=8",
"--gradient_accumulation_steps=1",
"--output_dir=/opt/huggingface/trained_model",
"--logging_steps=1",
"--num_train_epochs=-1",
alvarobartt marked this conversation as resolved.
Show resolved Hide resolved
"--max_steps=10",
"--gradient_checkpointing",
],
environment={
"TRL_USE_RICH": "0",
"ACCELERATE_LOG_LEVEL": "INFO",
"TRANSFORMERS_LOG_LEVEL": "INFO",
"TQDM_POSITION": "-1",
},
platform="linux/amd64",
detach=True,
# Mount the volume from the `tmp_path` to the `/opt/huggingface/trained_model`
volumes={
tmp_path: {
"bind": "/opt/huggingface/trained_model",
"mode": "rw",
}
},
# Extra `device_requests` related to the CUDA devices
device_requests=[DeviceRequest(count=-1, capabilities=[["gpu"]])],
)

# Start log streaming in a separate thread
log_thread = threading.Thread(target=stream_logs, args=(container,))
log_thread.daemon = True
log_thread.start()

# Wait for the container to finish
container.wait() # type: ignore

# Remove the container
container.remove() # type: ignore

assert tmp_path.exists()
assert (tmp_path / "model.safetensors").exists()


@pytest.mark.skipif(not CUDA_AVAILABLE, reason="CUDA is not available")
def test_trl_peft(caplog: pytest.LogCaptureFixture, tmp_path: PosixPath) -> None:
"""Adapted from https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py"""
caplog.set_level(logging.INFO)

client = docker.from_env()

logging.info("Running the container for TRL...")
container = client.containers.run(
os.getenv(
"TRAINING_DLC",
"us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310",
alvarobartt marked this conversation as resolved.
Show resolved Hide resolved
),
command=[
"trl",
"sft",
f"--model_name_or_path={MODEL_ID}",
"--dataset_text_field=text",
"--report_to=none",
"--learning_rate=1e-5",
"--per_device_train_batch_size=8",
"--gradient_accumulation_steps=1",
"--output_dir=/opt/huggingface/trained_model",
"--logging_steps=1",
"--num_train_epochs=-1",
alvarobartt marked this conversation as resolved.
Show resolved Hide resolved
"--max_steps=10",
"--gradient_checkpointing",
"--use_peft",
"--lora_r=64",
"--lora_alpha=16",
],
environment={
"TRL_USE_RICH": "0",
"ACCELERATE_LOG_LEVEL": "INFO",
"TRANSFORMERS_LOG_LEVEL": "INFO",
"TQDM_POSITION": "-1",
},
platform="linux/amd64",
detach=True,
# Mount the volume from the `tmp_path` to the `/opt/huggingface/trained_model`
volumes={
tmp_path: {
"bind": "/opt/huggingface/trained_model",
"mode": "rw",
}
},
# Extra `device_requests` related to the CUDA devices
device_requests=[DeviceRequest(count=-1, capabilities=[["gpu"]])],
)

# Start log streaming in a separate thread
log_thread = threading.Thread(target=stream_logs, args=(container,))
log_thread.daemon = True
log_thread.start()

# Wait for the container to finish
container.wait() # type: ignore

# Remove the container
container.remove() # type: ignore

assert tmp_path.exists()
assert (tmp_path / "adapter_config.json").exists()
assert (tmp_path / "adapter_model.safetensors").exists()
Loading