Skip to content

Building PyTorch for ROCm

Jithun Nair edited this page Sep 4, 2024 · 78 revisions

This is a quick guide to setup PyTorch with ROCm support.

Building PyTorch on ROCm on Ubuntu Docker

General remarks

The following steps can be used to setup PyTorch with ROCm support inside a docker container. Assumes a .deb based system. See ROCm install for supported operating systems and general information on the ROCm software stack. If your host system doesn't have docker installed, please refer to docker install. It is recommended to add the user to the docker group to run docker as a non-root user, please refer here.

An install of the latest released ROCm version is recommended.

  1. Follow the instructions from ROCm installation page to install the baseline ROCm driver
    https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html

Option 1 (Recommended) : Use docker image with PyTorch pre-installed

  1. Pull the latest public PyTorch docker image:
    docker pull rocm/pytorch:latest

Optionally, you can use one of these docker images.

This option provides a docker image which has PyTorch pre-installed. Users can launch the docker container and train/run deep learning models directly. This docker image will run on both gfx900 (Vega10-type GPU - MI25, Vega56, Vega64,...), gfx906 (Vega20-type GPU - MI50, MI60), gfx908 (MI100) and gfx90a (MI200). (https://rocm.github.io/ROCmInstall.html) 3) Start a docker container using the downloaded image:
docker run -it --privileged --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
This will automatically download the image if it does not exist on the host. You can also pass -v argument to mount any data directories on to the container.

Option 2: Install PyTorch using PyTorch ROCm base docker image

  1. Obtain docker image:
    docker pull rocm/pytorch:latest-base

  2. Start a docker container using the downloaded image:
    docker run -it -u root -v $HOME:/data --privileged --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base
    Note: This will mount your host home directory on /data in the container.
    Note: We start the docker container as root user to enable GPUs to be visible inside the container regardless of the driver installation on bare metal.

  3. Clone PyTorch repository:
    cd ~
    git clone https://github.com/pytorch/pytorch.git
    cd pytorch
    git submodule update --init --recursive

  4. Build PyTorch for ROCm:
    To compile pytorch for your uarch, export PYTORCH_ROCM_ARCH=<uarch> to the uarch(s) of interest eg. "gfx900"/"gfx906"/"gfx908" etc. If you wish to specify multiple uarchs, use a semicolon-separated list eg. export PYTORCH_ROCM_ARCH="gfx906;gfx908;gfx90a;gfx1030". To see which AMD uarch you have, run /opt/rocm/bin/rocm_agent_enumerator (might need to install rocminfo package)). Then build with
    .ci/pytorch/build.sh
    Note: On older version of code try .jenkins/pytorch/build.sh.
    This will first hipify the PyTorch sources and then compile, needing 16 GB of RAM to be available to the docker image.

Option 3: Install using PyTorch upstream docker file

  1. Clone PyTorch repository on the host:
    cd ~
    git clone https://github.com/pytorch/pytorch.git
    cd pytorch
    git submodule update --init --recursive

  2. Build PyTorch docker image:
    cd .ci/docker
    ./build.sh pytorch-linux-bionic-rocm<version>-py3.6 (eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.6)
    This should complete with a message "Successfully built <image_id>"

  3. Start a docker container using the new image:
    docker run -it -v $HOME:/data --privileged --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G <image_id>
    Note: This will mount your host home directory on /data in the container.

Follow steps 4-5 for Option 2 from here on.

Building PyTorch on ROCm on Ubuntu or "CentOS Stream 9" bare metal (without docker)

Step 1: Install ROCm following the page AMD ROCm installation and kernel-mode driver installation should be included.

Step 2: A Shell script is provided to build PyTorch on ROCm, which only works on ROCm 5.0 and newer version.

NOTE: This script need to be run by a user that has sudo permission.

PyTorch Build Script

A few modifications might be needed for the script according to your specific system and config:

  1. Environment variables: please refer to comments titled 'Required Variables' and 'Optional Variables'.
  2. Repo and branch name: please refer to comment titled 'Git clone Pytorch from a branch'.

This script has been tested in clean Ubuntu and CentOS Stream 9 environments. If you have any existing packages, which may cause conflicts, you will have to resolve conflicts accordingly.

Test the PyTorch installation

To validate PyTorch installation, run:

  1. Test Command
    cd ~ && python3 -c 'import torch' 2>/dev/null && echo "Success" || echo "Failure"

  2. Running unit tests in PyTorch
    Run the following command from pytorch home directory:
    .ci/pytorch/test.sh
    This runs all CI unit tests and skips as appropriate on your system based on ROCm and, e.g., single or multi GPU configuration. No tests will fail if the compilation and installation is correct. Additionally, this step will install torchvision which most PyTorch scripts use to load models. E.g., running the PyTorch examples requires torchvision. This step takes 3+ hours to complete since it runs all PyTorch unit tests.

Alternately, if you're looking to run a smaller set of PyTorch unit tests to sanity check your PyTorch installation, you can run the following from the test subdirectory:
PYTORCH_TEST_WITH_ROCM=1 python3 test_nn.py --verbose
PYTORCH_TEST_WITH_ROCM=1 python3 test_torch.py --verbose
PYTORCH_TEST_WITH_ROCM=1 python3 test_cuda.py --verbose

Similarly, you can run other individual test suites by replacing test_nn.py above with any other test set.

  1. Running pytorch-micro-benchmarking
    Another short and good way to test your PyTorch installation is by cloning https://github.com/ROCmSoftwarePlatform/pytorch-micro-benchmarking and running:
    python3 micro_benchmarking_pytorch.py --network resnet50
    To test PyTorch compile mode, python3 micro_benchmarking_pytorch.py --network resnet50 --compile

Try PyTorch examples

  1. Clone the PyTorch examples repository:
    git clone https://github.com/pytorch/examples.git

  2. Run individual example: MNIST
    cd examples/mnist
    Follow instructions in README.md, in this case:
    pip3 install -r requirements.txt
    python3 main.py

  3. Run individual example: Try ImageNet training
    cd ../imagenet
    Follow instructions in README.md.