Skip to content

Unknown error -2005270521 was caught when testing torch-directml resnet50 demo #672

Open
@Basicname

Description

@Basicname

Issue description:

When running python3 PyTorch/cv/resnet50/train.py, its output is like:

(torchml) tim@tim-pc:~/DirectML$ python3 PyTorch/cv/resnet50/train.py
Dropped Escape call with ulEscapeCode : 0x03007703
Loading the training dataset from: /home/tim/DirectML/PyTorch/cv/data/cifar-10-python
        Train data X [N, C, H, W]:
                shape=torch.Size([32, 3, 224, 224]),
                dtype=torch.float32
        Train data Y:
                shape=torch.Size([32]),
                dtype=torch.int64
Loading the testing dataset from: /home/tim/DirectML/PyTorch/cv/data/cifar-10-python
        Test data X [N, C, H, W]:
                shape=torch.Size([32, 3, 224, 224]),
                dtype=torch.float32
        Test data Y:
                shape=torch.Size([32]),
                dtype=torch.int64
Finished moving resnet50 to device: privateuseone:0 in 2.6226043701171875e-06s.
Epoch 1
-------------------------------
D3D12: Removing Device.
Traceback (most recent call last):
  File "/home/tim/DirectML/PyTorch/cv/resnet50/train.py", line 39, in <module>
    main()
  File "/home/tim/DirectML/PyTorch/cv/resnet50/train.py", line 34, in main
    train(args.path, args.batch_size, args.epochs, args.learning_rate,
  File "/home/tim/DirectML/PyTorch/cv/classification/train_classification.py", line 131, in main
    train(training_dataloader,
  File "/home/tim/DirectML/PyTorch/cv/classification/train_classification.py", line 84, in train
    batch_loss = loss(model(X), y)
  File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torchvision/models/resnet.py", line 285, in forward
    return self._forward_impl(x)
  File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torchvision/models/resnet.py", line 269, in _forward_impl
    x = self.bn1(x)
  File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py", line 175, in forward
    return F.batch_norm(
  File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/functional.py", line 2482, in batch_norm
    return torch.batch_norm(
RuntimeError: Unknown error -2005270521

After several tries, I found that the error will occur when running torch.batch_norm(), so I tried making a simple torch.Sequential() to run batch normalization, and it works as expected.
Also, my iGPU has a 8GB VRAM, and before the program crashes, only ~3.5GB is used, so there're still lots of free memory.

System details:

Python version: 3.10.0
WSL version: 2.3.26.0
WSL kernel version: 5.15.167.4-1
GPU: AMD Radeon 780M
DirectX version: 12.1
Pytorch version:

torch                    2.2.1
torch-directml           0.2.1.dev240521
torchvision              0.17.1

I also tried:

torch                    2.4.1
torch-directml           0.2.5.dev240914
torchvision              0.19.1

and the same error occurred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions