Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PI_ERROR_INVALID_QUEUE after copying device 0 tensor to device 1 #745

Closed
daisyden opened this issue Aug 12, 2024 · 9 comments
Closed

PI_ERROR_INVALID_QUEUE after copying device 0 tensor to device 1 #745

daisyden opened this issue Aug 12, 2024 · 9 comments
Assignees
Milestone

Comments

@daisyden
Copy link
Contributor

🐛 Describe the bug

import torch
a = torch.empty(3, device=torch.device('xpu:0'))
a.fill_(1.1)
b = a.to(device='xpu:1')
a.device
b.device
print(b.cpu())
**print(b)**

Report:

tensor([1.1000, 1.1000, 1.1000])
Traceback (most recent call last):
  File "/home/gta/daisyden/pytorch4/test/aa.py", line 8, in <module>
    print(b)
  File "/home/gta/miniforge3/envs/daisy_pytorch4/lib/python3.10/site-packages/torch/_tensor.py", line 464, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/home/gta/miniforge3/envs/daisy_pytorch4/lib/python3.10/site-packages/torch/_tensor_str.py", line 714, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/home/gta/miniforge3/envs/daisy_pytorch4/lib/python3.10/site-packages/torch/_tensor_str.py", line 631, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/gta/miniforge3/envs/daisy_pytorch4/lib/python3.10/site-packages/torch/_tensor_str.py", line 363, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/gta/miniforge3/envs/daisy_pytorch4/lib/python3.10/site-packages/torch/_tensor_str.py", line 152, in __init__
    nonzero_finite_vals = torch.masked_select(
RuntimeError: Native API failed. Native API returns: -36 (PI_ERROR_INVALID_QUEUE) -36 (PI_ERROR_INVALID_QUEUE)

Versions

latest version

@daisyden daisyden changed the title PI_ERROR_INVALID_QUEUE after copy device 0 tensor to device 1 PI_ERROR_INVALID_QUEUE after copying device 0 tensor to device 1 Aug 12, 2024
@fengyuan14
Copy link
Contributor

fengyuan14 commented Aug 12, 2024

SYCL runtime issue.

As latest SYCL spec, we are recommended to use info::kernel_device_specific::work_group_size instead of info::device::max_work_group_size. But there is a new issue found. Cannot launch kernel successfully on PVC Tile 1 after querying info::kernel_device_specific::work_group_size. Got runtime error.

@chuanqi129 chuanqi129 modified the milestones: PT2.6, PT2.5 Aug 13, 2024
@daisyden daisyden mentioned this issue Aug 13, 2024
4 tasks
@daisyden
Copy link
Contributor Author

daisyden commented Aug 13, 2024

duplicated with #339

@fengyuan14
Copy link
Contributor

The issue is common for all platform where there are devices more than one. The most important and most common case for us is client case, a client platform/desktop has an iGPU and an dGPU.

@fengyuan14
Copy link
Contributor

intel/llvm#15127

@ddkalamk
Copy link

@fengyuan14 can we please apply the workaround available to fix this problem?

i.e. change
https://github.com/intel/torch-xpu-ops/blob/main/src/comm/DeviceProperties.h#L19C3-L20C79
auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(ctx, {kid});

to

auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(ctx, {dev}, {kid});

@ddkalamk
Copy link

@daisyden @fengyuan14
Test results after applying fix:

(pt_src) [ddkalamk@pcl-pvc01 pytorch]$ cat test2.py
import torch
print("PyTorch version: ", torch.__version__)
a = torch.empty(3, device=torch.device('xpu:0'))
a.fill_(1.1)
b = a.to(device='xpu:1')
a.device
b.device
print(b.cpu())
print(b)

(pt_src) [ddkalamk@pcl-pvc01 pytorch]$ python -u test2.py
PyTorch version:  2.5.0a0+git8693322
tensor([1.1000, 1.1000, 1.1000])
tensor([1.1000, 1.1000, 1.1000], device='xpu:1')

@fengyuan14
Copy link
Contributor

fengyuan14 commented Sep 13, 2024

Hi, @ddkalamk.
We have got a PR for it on main branch. Recently, we are busy on PT2.5 release. Will land the PR ASAP.
#769

@ddkalamk
Copy link

Sounds good, thanks.

@chuanqi129 chuanqi129 modified the milestones: PT2.5, PT2.6 Oct 14, 2024
@fengyuan14
Copy link
Contributor

The WA has been merged in main branch. #769

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants