Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 hardware agnostic front and backend #5

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
070db5c
'Add AMD support for TorchServe'
smedegaard Nov 1, 2024
ce19723
Update README.md with rocm flags
smedegaard Nov 11, 2024
0fad8e2
add rocm to CONTRIBUTING.md
smedegaard Nov 11, 2024
3247498
WorkerLifeCycle uses SystemInfo to get X_VISIBLE_DEVICES
smedegaard Nov 12, 2024
bae9b2c
AppleUtil adds Accelerator `number_of_cores` times
smedegaard Nov 12, 2024
88f3cb8
fix typo in README.md
smedegaard Nov 13, 2024
8e4d24c
remove mention of java version from README.md
smedegaard Nov 13, 2024
ff4daa8
revert unnecessary changes
samutamm Nov 14, 2024
0bc3e3c
Fix import errors in AppleUtils
jakki-amd Nov 14, 2024
1e635e1
remove rocm support from dockerfile.dev to simplify
samutamm Nov 14, 2024
1647826
fix missing newline
samutamm Nov 14, 2024
0dc5145
revert unnecessary changes
samutamm Nov 14, 2024
f905d0e
'improve formatting for amd_support.md'
Nov 14, 2024
9a515b8
Fix AppleUtils tests
jakki-amd Nov 18, 2024
9d30159
fixes 11. parse-metrics-failed-collecting-amd-gpu-metrics (#24)
smedegaard Nov 20, 2024
8cdf54b
extend testMetricManager
Nov 20, 2024
bd95835
Merge pull request #25 from nod-ai/9-extend-java-testmetricmanager
eppane Nov 21, 2024
e5d382f
Add latest ROCM support
Nov 14, 2024
607d836
Merge pull request #26 from nod-ai/19-add-support-for-latest-torch-rocm
jakki-amd Nov 21, 2024
f2d17d5
PR 24 system_metrics bugfix
Nov 22, 2024
49bc051
Format files
jakki-amd Nov 22, 2024
4bff6d3
Update docs/hardware_support/amd_support.md
smedegaard Nov 26, 2024
b9a1627
typo in docs/hardware_support/amd_support.md
smedegaard Nov 26, 2024
964e5f1
Update docs/hardware_support/amd_support.md
smedegaard Nov 26, 2024
61da32e
Update docs/hardware_support/amd_support.md
smedegaard Nov 26, 2024
0a4d628
remove pyrsmi and nvgpu deps
Nov 26, 2024
aa96f2f
metric collector revert gpu arg name
Nov 26, 2024
a26eefb
fix number of metrics assertion in testMetricManager
Nov 26, 2024
f0b1dfb
'move Intel docs under Hardware Support' (#31)
smedegaard Nov 27, 2024
d330494
Fix docstring
jakki-amd Nov 27, 2024
cbdfe25
Add Dockerfile.rocm
jakki-amd Nov 28, 2024
8330233
Remove sharing lock from bind mounts
jakki-amd Nov 28, 2024
9e5afd0
Update Dockerfile.rocm
jakki-amd Nov 29, 2024
8f35524
Revert Dockerfile changes
jakki-amd Nov 29, 2024
f5ce2ec
Update documentation for Docker support
jakki-amd Nov 29, 2024
f03d0fd
Merge branch 'master' into 2-hardware-agnostic-front-and-backend
jakki-amd Nov 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,9 @@ instances.yaml.backup
# cpp
cpp/_build
cpp/third-party

# projects
.tool-versions
**/*/.classpath
**/*/.settings
**/*/.project
57 changes: 25 additions & 32 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,7 @@ Your contributions will fall into two categories:
- Search for your issue here: https://github.com/pytorch/serve/issues (look for the "good first issue" tag if you're a first time contributor)
- Pick an issue and comment on the task that you want to work on this feature.
- To ensure your changes doesn't break any of the existing features run the sanity suite as follows from serve directory:
- Install dependencies (if not already installed)
For CPU

```bash
python ts_scripts/install_dependencies.py --environment=dev
smedegaard marked this conversation as resolved.
Show resolved Hide resolved
```

For GPU
```bash
python ts_scripts/install_dependencies.py --environment=dev --cuda=cu121
```
> Supported cuda versions as cu121, cu118, cu117, cu116, cu113, cu111, cu102, cu101, cu92
- [Install dependencies](#Install-TorchServe-for-development) (if not already installed)
- Install `pre-commit` to your Git flow:
```bash
pre-commit install
Expand Down Expand Up @@ -60,26 +49,30 @@ pytest -k test/pytest/test_mnist_template.py

If you plan to develop with TorchServe and change some source code, you must install it from source code.

Ensure that you have `python3` installed, and the user has access to the site-packages or `~/.local/bin` is added to the `PATH` environment variable.

Run the following script from the top of the source directory.

NOTE: This script force re-installs `torchserve`, `torch-model-archiver` and `torch-workflow-archiver` if existing installations are found

#### For Debian Based Systems/ MacOS

```
python ./ts_scripts/install_dependencies.py --environment=dev
python ./ts_scripts/install_from_src.py --environment=dev
```

Use `--cuda` flag with `install_dependencies.py` for installing cuda version specific dependencies. Possible values are `cu111`, `cu102`, `cu101`, `cu92`

#### For Windows

Refer to the documentation [here](docs/torchserve_on_win_native.md).

For information about the model archiver, see [detailed documentation](model-archiver/README.md).
1. Clone the repository, including third-party modules, with `git clone --recurse-submodules --remote-submodules [email protected]:pytorch/serve.git`
eppane marked this conversation as resolved.
Show resolved Hide resolved
2. Ensure that you have `python3` installed, and the user has access to the site-packages or `~/.local/bin` is added to the `PATH` environment variable.
3. Run the following script from the top of the source directory. NOTE: This script force re-installs `torchserve`, `torch-model-archiver` and `torch-workflow-archiver` if existing installations are found

#### For Debian Based Systems/MacOS

```
python ./ts_scripts/install_dependencies.py --environment=dev
python ./ts_scripts/install_from_src.py --environment=dev
```
##### Installing Dependencies for Accelerator Support
Use the optional `--rocm` or `--cuda` flag with `install_dependencies.py` for installing accelerator specific dependencies.

Possible values are
- rocm: `rocm61`, `rocm60`
- cuda: `cu111`, `cu102`, `cu101`, `cu92`

For example `python ./ts_scripts/install_dependencies.py --environment=dev --rocm=rocm61`

#### For Windows

Refer to the documentation [here](docs/torchserve_on_win_native.md).

For information about the model archiver, see [detailed documentation](model-archiver/README.md).

### What to Contribute?

Expand Down
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,10 @@ curl http://127.0.0.1:8080/predictions/bert -T input.txt

```bash
# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py

# Include dependencies for accelerator support with the relevant optional flags
python ./ts_scripts/install_dependencies.py --rocm=rocm61
python ./ts_scripts/install_dependencies.py --cuda=cu121

# Latest release
Expand All @@ -36,7 +39,10 @@ pip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archi

```bash
# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py

# Include depeendencies for accelerator support with the relevant optional flags
smedegaard marked this conversation as resolved.
Show resolved Hide resolved
python ./ts_scripts/install_dependencies.py --rocm=rocm61
python ./ts_scripts/install_dependencies.py --cuda=cu121

# Latest release
Expand Down
7 changes: 6 additions & 1 deletion docs/contents.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
model_zoo
request_envelopes
server
nvidia_mps
snapshot
intel_extension_for_pytorch <https://github.com/pytorch/serve/tree/master/examples/intel_extension_for_pytorch>
torchserve_on_win_native
Expand All @@ -27,6 +26,12 @@
Security
FAQs

.. toctree::
:maxdepth: 0
:caption: Hardware Support:

hardware_support/hardware_support

.. toctree::
:maxdepth: 0
:caption: Service APIs:
Expand Down
67 changes: 67 additions & 0 deletions docs/hardware_support/amd_support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# AMD Support

TorchServe can be run on any combination of operating system and device that is
[supported by ROCm](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html).

## Supported Versions of ROCm

The current stable `major.patch` version of ROCm and the previous path version will be supported. For example version `N.2` and `N.1` where `N` is the current major version.

## Installation

- Make sure you have **python >= 3.8 installed** on your system.
- clone the repo
```bash
git clone [email protected]:pytorch/serve.git
```

- cd into the cloned folder

```bash
cd serve
```

- create a virtual environment for python

```bash
python -m venv venv
```

- activate the virtual environment. If you use another shell (fish, csh, powershell) use the relevant option in from `/venv/bin/`
```bash
source venv/bin/activate
```

- install the dependencies needed for ROCm support.

```bash
python ./ts_scripts/install_dependencies.py --rocm=rocm61
python ./ts_scripts/install_from_src.py
```
- enable amd-smi in the python virtual environment
```bash
sudo chown -R $USER:$USER /opt/rocm/share/amd_smi/
pip install -e /opt/rocm/share/amd_smi/
```

### Selecting Accelerators Using `HIP_VISIBLE_DEVICES`

If you have multiple accelerators on the system where you are running TorchServe you can select which accelerators should be visible to TorchServe
by setting the environment variable `HIP_VISIBLE_DEVICES` to a string of 0-indexed comma-separated integers representing the ids of the accelerators.

If you have 8 accelerators but only want TorchServe to see the last four of them do `export HIP_VISIBLE_DEVICES=4,5,6,7`.

>ℹ️ **Not setting** `HIP_VISIBLE_DEVICES` will cause TorchServe to use all available accelerators on the system it is running on.

> ⚠️ You can run into trouble if you set `HIP_VISIBLE_DEVICES` to an empty string.
> eg. `export HIP_VISIBLE_DEVICES=` or `export HIP_VISIBLE_DEVICES=""`
> use `unset HIP_VISIBLE_DEVICES` if you want to remove its effect.

> ⚠️ Setting both `CUDA_VISIBLE_DEVICES` and `HIP_VISIBLE_DEVICES` may cause unintended behaviour and should be avoided.
> Doing so may cause an exception in the future.

## Example Usage

After installing TorchServe with the required dependencies for ROCm you should be ready to serve your model.

For a simple example, refer to `serve/examples/image_classifier/mnist/`.
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# Apple Silicon Support
# Apple Silicon Support

## What is supported
## What is supported
* TorchServe CI jobs now include M1 hardware in order to ensure support, [documentation](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories) on github M1 hardware.
- [Regression Tests](https://github.com/pytorch/serve/blob/master/.github/workflows/regression_tests_cpu.yml)
- [Regression binaries Test](https://github.com/pytorch/serve/blob/master/.github/workflows/regression_tests_cpu_binaries.yml)
- [Regression Tests](https://github.com/pytorch/serve/blob/master/.github/workflows/regression_tests_cpu.yml)
- [Regression binaries Test](https://github.com/pytorch/serve/blob/master/.github/workflows/regression_tests_cpu_binaries.yml)
* For [Docker](https://docs.docker.com/desktop/install/mac-install/) ensure Docker for Apple silicon is installed then follow [setup steps](https://github.com/pytorch/serve/tree/master/docker)

## Experimental Support

* For GPU jobs on Apple Silicon, [MPS](https://pytorch.org/docs/master/notes/mps.html) is now auto detected and enabled. To prevent TorchServe from using MPS, users have to set `deviceType: "cpu"` in model-config.yaml.
* This is an experimental feature and NOT ALL models are guaranteed to work.
* For GPU jobs on Apple Silicon, [MPS](https://pytorch.org/docs/master/notes/mps.html) is now auto detected and enabled. To prevent TorchServe from using MPS, users have to set `deviceType: "cpu"` in model-config.yaml.
* This is an experimental feature and NOT ALL models are guaranteed to work.
* Number of GPUs now reports GPUs on Apple Silicon

### Testing
* [Pytests](https://github.com/pytorch/serve/tree/master/test/pytest/test_device_config.py) that checks for MPS on MacOS M1 devices
### Testing
* [Pytests](https://github.com/pytorch/serve/tree/master/test/pytest/test_device_config.py) that checks for MPS on MacOS M1 devices
* Models that have been tested and work: Resnet-18, Densenet161, Alexnet
* Models that have been tested and DO NOT work: MNIST

Expand All @@ -31,10 +31,10 @@ Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store:
Model Store:
Initial Models: resnet-18=resnet-18.mar
Log dir:
Metrics dir:
Log dir:
Metrics dir:
Netty threads: 0
Netty client threads: 0
Default workers per model: 16
Expand All @@ -48,7 +48,7 @@ Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store:
Workflow Store:
CPP log config: N/A
Model config: N/A
024-04-08T14:18:02,380 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
Expand All @@ -69,17 +69,17 @@ serve % curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_cla
}
...
```
#### Conda Example
#### Conda Example

```
(myenv) serve % pip list | grep torch
(myenv) serve % pip list | grep torch
torch 2.2.1
torchaudio 2.2.1
torchdata 0.7.1
torchtext 0.17.1
torchvision 0.17.1
(myenv3) serve % conda install -c pytorch-nightly torchserve torch-model-archiver torch-workflow-archiver
(myenv3) serve % pip list | grep torch
(myenv3) serve % pip list | grep torch
torch 2.2.1
torch-model-archiver 0.10.0b20240312
torch-workflow-archiver 0.2.12b20240312
Expand Down Expand Up @@ -119,11 +119,11 @@ System metrics command: default
2024-03-12T15:58:54,702 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: densenet161, count: 10
Model server started.
...
(myenv3) serve % curl http://127.0.0.1:8080/predictions/densenet161 -T examples/image_classifier/kitten.jpg
(myenv3) serve % curl http://127.0.0.1:8080/predictions/densenet161 -T examples/image_classifier/kitten.jpg
{
"tabby": 0.46661922335624695,
"tiger_cat": 0.46449029445648193,
"Egyptian_cat": 0.0661405548453331,
"lynx": 0.001292439759708941,
"plastic_bag": 0.00022909720428287983
}
}
7 changes: 7 additions & 0 deletions docs/hardware_support/hardware_support.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. toctree::
:caption: Hardware Support:

amd_support
apple_silicon_support
linux_aarch64
nvidia_mps
File renamed without changes.
File renamed without changes.
4 changes: 2 additions & 2 deletions frontend/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ def javaProjects() {

configure(javaProjects()) {
apply plugin: 'java-library'
sourceCompatibility = 1.8
targetCompatibility = 1.8
sourceCompatibility = JavaVersion.VERSION_17
targetCompatibility = JavaVersion.VERSION_17

defaultTasks 'jar'

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
package org.pytorch.serve.device;

import java.text.MessageFormat;
import org.pytorch.serve.device.interfaces.IAcceleratorUtility;

public class Accelerator {
public final Integer id;
public final AcceleratorVendor vendor;
public final String model;
public IAcceleratorUtility acceleratorUtility;
public Float usagePercentage;
public Float memoryUtilizationPercentage;
public Integer memoryAvailableMegabytes;
public Integer memoryUtilizationMegabytes;

public Accelerator(String acceleratorName, AcceleratorVendor vendor, Integer gpuId) {
this.model = acceleratorName;
this.vendor = vendor;
this.id = gpuId;
this.usagePercentage = (float) 0.0;
this.memoryUtilizationPercentage = (float) 0.0;
this.memoryAvailableMegabytes = 0;
this.memoryUtilizationMegabytes = 0;
}

// Getters
public Integer getMemoryAvailableMegaBytes() {
return memoryAvailableMegabytes;
}

public AcceleratorVendor getVendor() {
return vendor;
}

public String getAcceleratorModel() {
return model;
}

public Integer getAcceleratorId() {
return id;
}

public Float getUsagePercentage() {
return usagePercentage;
}

public Float getMemoryUtilizationPercentage() {
return memoryUtilizationPercentage;
}

public Integer getMemoryUtilizationMegabytes() {
return memoryUtilizationMegabytes;
}

// Setters
public void setMemoryAvailableMegaBytes(Integer memoryAvailable) {
this.memoryAvailableMegabytes = memoryAvailable;
}

public void setUsagePercentage(Float acceleratorUtilization) {
this.usagePercentage = acceleratorUtilization;
}

public void setMemoryUtilizationPercentage(Float memoryUtilizationPercentage) {
this.memoryUtilizationPercentage = memoryUtilizationPercentage;
}

public void setMemoryUtilizationMegabytes(Integer memoryUtilizationMegabytes) {
this.memoryUtilizationMegabytes = memoryUtilizationMegabytes;
}

// Other Methods
public String utilizationToString() {
final String message =
MessageFormat.format(
"gpuId::{0} utilization.gpu::{1} % utilization.memory::{2} % memory.used::{3} MiB",
id,
usagePercentage,
memoryUtilizationPercentage,
memoryUtilizationMegabytes);

return message;
}

public void updateDynamicAttributes(Accelerator updated) {
this.usagePercentage = updated.usagePercentage;
this.memoryUtilizationPercentage = updated.memoryUtilizationPercentage;
this.memoryUtilizationMegabytes = updated.memoryUtilizationMegabytes;
}
}
Loading
Loading