Open
Description
Issue description:
When running python3 PyTorch/cv/resnet50/train.py
, its output is like:
(torchml) tim@tim-pc:~/DirectML$ python3 PyTorch/cv/resnet50/train.py
Dropped Escape call with ulEscapeCode : 0x03007703
Loading the training dataset from: /home/tim/DirectML/PyTorch/cv/data/cifar-10-python
Train data X [N, C, H, W]:
shape=torch.Size([32, 3, 224, 224]),
dtype=torch.float32
Train data Y:
shape=torch.Size([32]),
dtype=torch.int64
Loading the testing dataset from: /home/tim/DirectML/PyTorch/cv/data/cifar-10-python
Test data X [N, C, H, W]:
shape=torch.Size([32, 3, 224, 224]),
dtype=torch.float32
Test data Y:
shape=torch.Size([32]),
dtype=torch.int64
Finished moving resnet50 to device: privateuseone:0 in 2.6226043701171875e-06s.
Epoch 1
-------------------------------
D3D12: Removing Device.
Traceback (most recent call last):
File "/home/tim/DirectML/PyTorch/cv/resnet50/train.py", line 39, in <module>
main()
File "/home/tim/DirectML/PyTorch/cv/resnet50/train.py", line 34, in main
train(args.path, args.batch_size, args.epochs, args.learning_rate,
File "/home/tim/DirectML/PyTorch/cv/classification/train_classification.py", line 131, in main
train(training_dataloader,
File "/home/tim/DirectML/PyTorch/cv/classification/train_classification.py", line 84, in train
batch_loss = loss(model(X), y)
File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torchvision/models/resnet.py", line 269, in _forward_impl
x = self.bn1(x)
File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py", line 175, in forward
return F.batch_norm(
File "/home/tim/anaconda3/envs/torchml/lib/python3.10/site-packages/torch/nn/functional.py", line 2482, in batch_norm
return torch.batch_norm(
RuntimeError: Unknown error -2005270521
After several tries, I found that the error will occur when running torch.batch_norm(),
so I tried making a simple torch.Sequential()
to run batch normalization, and it works as expected.
Also, my iGPU has a 8GB VRAM, and before the program crashes, only ~3.5GB is used, so there're still lots of free memory.
System details:
Python version: 3.10.0
WSL version: 2.3.26.0
WSL kernel version: 5.15.167.4-1
GPU: AMD Radeon 780M
DirectX version: 12.1
Pytorch version:
torch 2.2.1
torch-directml 0.2.1.dev240521
torchvision 0.17.1
I also tried:
torch 2.4.1
torch-directml 0.2.5.dev240914
torchvision 0.19.1
and the same error occurred.
Metadata
Metadata
Assignees
Labels
No labels