Skip to content

Performance analysis of PyTorch

Jithun Nair edited this page Nov 4, 2019 · 8 revisions

Foreword

The PyTorch port to ROCm is under active development especially in regards to performance. We are focusing our efforts on server-grade accelerators (MI25/MI60/...) but the following applies to all supported AMD hardware.

Performance analysis

We supply a small microbenchmarking script for PyTorch training on ROCm. To use, download micro_benchmarking_pytorch.py, fp16util.py, shufflenet.py, and shufflenet_v2.py.

To execute: python micro_benchmarking_pytorch.py --network <network name> [--batch-size <batch size> ] [--iterations <number of iterations>] [--fp16 <0 or 1> ] [--dataparallel|--distributed_dataparallel] [--device_ids <comma separated list (no spaces) of GPU indices (0-indexed) to run dataparallel/distributed_dataparallel api on>]

Possible network names are: alexnet, densenet121, inception_v3, resnet50, resnet101, SqueezeNet, vgg16 etc.

Default are 10 training iterations, fp16 off (i.e., 0), and a batch size of 64.

--distributed_dataparallel will spawn multiple sub-processes and adjust world_size and rank accordingly. Py3.6 ONLY.

Performance tuning

If performance on a specific card and/or model is found to be lacking, typically some gains can be made by tuning MIOpen. For this, export MIOPEN_FIND_ENFORCE=3 prior to running the model. This will take some time if untuned configurations are encountered and write to a local performance database. More information on this can be found in the MIOpen documentation.