-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
init #9
base: main
Are you sure you want to change the base?
Conversation
@@ -2693,15 +2726,14 @@ def all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False): | |||
return _IllegalWork() | |||
else: | |||
return None | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change
@@ -4059,15 +4091,14 @@ def reduce_scatter_tensor(output, input, op=ReduceOp.SUM, group=None, async_op=F | |||
return _IllegalWork() | |||
else: | |||
return None | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change
raise RuntimeError("Distributed package doesn't have XCCL built in") | ||
if backend_options is not None: | ||
assert isinstance( | ||
backend_options, ProcessGroupXCCL.Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProcessGroupXCCL.Options
doesn't have Options
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no Options yet
@@ -1752,10 +1775,9 @@ def _new_process_group_helper( | |||
"created, please use a different group name" | |||
) | |||
|
|||
if device_id is not None and (device_id.index is None or device_id.type != "cuda"): | |||
if device_id is not None and device_id.index is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a common problem for non-CUDA device, then should we separate this change in other PR?
@@ -340,7 +354,7 @@ def register_backend( | |||
"`cuda`. Please specify it via the `devices` argument of " | |||
"`register_backend`." | |||
) | |||
Backend.backend_capability[name.lower()] = ["cpu", "cuda"] | |||
Backend.backend_capability[name.lower()] = ["cpu", "cuda", "xpu"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure name
in this API is applicable to xpu
device?
@@ -657,6 +663,7 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder { | |||
// TODO: HACK for backend name to get sequence number for that backend. | |||
if (backendType == ProcessGroup::BackendType::GLOO || | |||
backendType == ProcessGroup::BackendType::NCCL || | |||
backendType == ProcessGroup::BackendType::XCCL || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does XCCL have this support right now?
@@ -1104,6 +1104,9 @@ if(USE_XPU) | |||
message(WARNING "Failed to include ATen XPU implementation target") | |||
else() | |||
target_link_libraries(torch_xpu PRIVATE torch_xpu_ops) | |||
if(USE_C10D_XCCL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments
@@ -327,7 +341,7 @@ def register_backend( | |||
Backend.backend_list.append(name.lower()) | |||
if devices is not None: | |||
for device in devices: | |||
if device != "cpu" and device != "cuda": | |||
if device != "cpu" and device != "cuda" and device != "xpu": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double check for this change
See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn
Fixes #ISSUE_NUMBER