-
Notifications
You must be signed in to change notification settings - Fork 545
implement send and recv using collective_permute #9373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -326,6 +326,28 @@ def test_all_to_all_single(self, use_dynamo): | |
expected.sort().values), | ||
f"Got {val}, expected {expected}") | ||
|
||
@staticmethod | ||
def _send_recv(): | ||
dist.init_process_group("xla", init_method='xla://') | ||
device = torch_xla.device() | ||
world_size = xr.world_size() | ||
cutoff = world_size // 2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if the world size is not even, this test will hang. For example, if world size is 3, then index 0 will send to 1 and 1 will recv from 0, but index 2 will try to recv from 1 without an associated send. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I'll update the test so that it is more defensive |
||
index = xr.global_ordinal() | ||
tensor = torch.tensor([index + 1], dtype=torch.float, device=device) | ||
if index < cutoff: | ||
dist.send(tensor, index + cutoff) | ||
else: | ||
dist.recv(tensor, index - cutoff) | ||
return tensor.cpu() | ||
|
||
def test_send_recv(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original test separated both send and receive. While this is more code efficient, it might be harder to debug as it will not be obvious what the issue is. I think keeping a test for the total interaction is valid, but is there a way to replicate the other two tests that existed previously? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. send and recv don't work independently. The original test was a "dry run" -- it checked the IR but didn't execute. If it did execute it would fail. |
||
"""Send tensors on first N/2 devices to second N/2 devices.""" | ||
results = pjrt.run_multiprocess(self._send_recv) | ||
world_size = tpu.num_expected_global_devices() | ||
for ordinal, value in results.items(): | ||
expected = ordinal + 1 if ordinal < world_size // 2 else ordinal + 1 - world_size // 2 | ||
np.testing.assert_array_equal(value, [expected]) | ||
|
||
|
||
if __name__ == '__main__': | ||
absltest.main() |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -748,9 +748,6 @@ def collective_permute(value: torch.Tensor, | |
pairs: List[List[int]]) -> torch.Tensor: | ||
"""Performs a XLA `CollectivePermute()` operation on the input tensor. | ||
|
||
WARNING: This function is not very reliable, may produce wrong results under | ||
certain inputs. Use it at your own risk. | ||
|
||
Comment on lines
-751
to
-753
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As discussed in #8815 there's no context for this ancient warning. Given the age, lack of details, and lack of any other reported bugs I think it's best to remove it. If we get a specific bug report then we can act on that. |
||
See: https://www.tensorflow.org/xla/operation_semantics#collectivepermute | ||
|
||
Args: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -235,41 +235,36 @@ def gather(self, *args): | |
def scatter(self, *args): | ||
raise NotImplementedError | ||
|
||
# Dummy channel id maker. Different backend (TPU, GPU, etc) should replace | ||
# the maker with their specific one. See unit test in | ||
# test/test_torch_distributed_xla_backend.py for an example. | ||
def make_send_channel_id(self, dst_rank, tag): | ||
raise NotImplementedError | ||
|
||
# Call site e.g. | ||
# https://github.com/pytorch/pytorch/blob/release/1.10/torch/distributed/distributed_c10d.py#L877 | ||
def send(self, tensors, dst_rank, tag=0): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we're warning to use collective_permute, but it still ends up using a collective permute, should the warning itself be clearer that this is happening under the hood? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could word this better. The real advice is to restructure your code so that each process calls collective_permute with all of the send-recv pairs |
||
logging.warning( | ||
"Individual send/recv ops are inefficient on an XLA device. Consider using xla_model.collective_permute()." | ||
) | ||
Comment on lines
+241
to
+243
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it happen to print it everytime we trace? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably. I'm not sure how to only make it print once -- will look into it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I checked around, and couldn't find a in built way to do this through |
||
results = [] | ||
for t in tensors: | ||
channel_id = self.make_send_channel_id(dst_rank, tag) | ||
# The input will be returned as result. | ||
input_as_result = xm.send(t, channel_id) | ||
# Make the sent tensor depend on the token, such that the `send` | ||
# op can actually be built into the computation graph. | ||
result_t = xm.collective_permute( | ||
t, pairs=[[xr.global_ordinal(), dst_rank]]) | ||
# Every process must have the same IR, otherwise they deadlock. But in | ||
# the receiving process the provided tensor receives the result, while | ||
# in the sending process it is unchanged. The solution used here is to | ||
# have every process copy a linear combination of the two tensors, but | ||
# send/recv use different coefficients to achieve different outcomes. | ||
Comment on lines
+250
to
+252
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This took a couple reads until I understood what was going on here. My understanding is that by having both If this understanding is correct, could you add a little bit more here to make it more apparent? |
||
with torch.no_grad(): | ||
t.copy_(input_as_result) | ||
results.append(input_as_result) | ||
t.copy_(result_t * 0.0 + t * 1.0) | ||
results.append(result_t) | ||
return _ret_work(results) | ||
|
||
# Dummy channel id maker. Different backend (TPU, GPU, etc) should replace | ||
# the maker with their specific one. See unit test in | ||
# test/test_torch_distributed_xla_backend.py for an example. | ||
def make_recv_channel_id(self, src_rank, tag): | ||
raise NotImplementedError | ||
|
||
# Call site e.g. | ||
# https://github.com/pytorch/pytorch/blob/release/1.10/torch/distributed/distributed_c10d.py#L913 | ||
def recv(self, out_tensors, src_rank, tag=0): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need the warning on the recv end too, so each host has it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should not assume someone reading "recv" will have read the documentation for "send". I think we should add documentation here. I would then add a note specific about what the IR expectation will be for "send" and "recv" on each of their comments. |
||
results = [] | ||
for ot in out_tensors: | ||
channel_id = self.make_recv_channel_id(src_rank, tag) | ||
result = xm.recv(ot, channel_id) | ||
results.append(result) | ||
result_t = xm.collective_permute( | ||
ot, pairs=[[src_rank, xr.global_ordinal()]]) | ||
with torch.no_grad(): | ||
ot.copy_(result_t * 1.0 + ot * 0.0) | ||
results.append(result_t) | ||
return _ret_work(results) | ||
|
||
def recv_anysource(self, *args): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time we checked, we also noticed that https://github.com/pytorch/xla/blob/master/test/test_mp_collective_permute.py didn't work on the CPU, but send/recv did. We might want to double check it.
Is
test/test_torch_distributed_xla_backend.py
tested for CPU and Neuron? Would it be possible to test it and see if the change is compatible?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, but it just checks that the expected IR is emitted. It doesn't run anything. And in this case it wasn't a reliable test because, at least for TPU, that IR does not actually run.
test_mp_collective_permute is run for both TPU and Neuron. I don't think it works for CPU but neither do send/recv. The success of test_mp_collective_permute indicates this change should work for Neuron, but to be more certain I could add a test that covers a pipeline-like transfer in addition to the existing test of a permutation-like transfer.
The most direct test would be something like what's in test_collective_ops_tpu.py, which runs the ops to completion, for Neuron.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be great. Any chance we can move it outside of this file and make it general? I can help test it out if so. Otherwise, I'll need to follow up if we can port this entire file to Neuron. I see
tpu.num_expected_global_devices
, andpjrt.run_multiprocess
, but haven't seen/used these before.