-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #925: ht.nonzero() returns tuple of 1-D arrays instead of n-D arrays #937
fix #925: ht.nonzero() returns tuple of 1-D arrays instead of n-D arrays #937
Conversation
different python and pytorch versions
for more information, see https://pre-commit.ci
delete example with different split axis
…oc/901-tutorial_update Update tutorial.ipynb
…ocs/927-citation Create CITATION.cff
…nhancement/203-ghactions-matrix run tests on gh actions
Removal of old logo
GPU cluster tests are currently disabled on this Pull Request. |
Hi @ClaudiaComito. This is the solution that I could come up with at first glance. It works well for the arrays I tested. But I still have to test different edge-cases like 1-D arrays, etc... Just thought I could get your input before proceeding further. |
1d683cf
to
96fabe5
Compare
My fix failed for 1-D arrays. I think I fixed it now! |
I will update the documentation and the tests a little bit later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Mystic-Slice thanks a lot for diving into this right away!
Note that we only use torch operations internally. Our tensors might reside on GPUs, and a numpy operation would require copying the tensor to CPU.
be699be
to
dd1b83d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done @Mystic-Slice, we're getting close, thanks a lot! The tests will be more work because many existing functions rely on the previous nonzero
format.
@Mystic-Slice I changed the base branch of this PR, because the change you implemented is needed in our indexing overhaul PR #938. I've taken care of resolving conflicts (hence the commits). I hope you don't mind. |
Not at all. |
Absolutely! |
@ClaudiaComito |
@ClaudiaComito Hi, I made the fixes. But I am not able to get rid of a few errors that arise in testing. Can you please take a look? |
@Mystic-Slice , the DNDarray metadata always have to be correct. The information doesn't get lost, it gets used in all subsequent operations, and the returned tuple is a tuple of DNDarrays whose metadata will be based on the original one. |
heat/core/indexing.py
Outdated
comm=x.comm, | ||
balanced=False, | ||
lcl_nonzero = lcl_nonzero.transpose(0, 1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gout
must reflect the transposed global shape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mystic-Slice thanks for the changes. Just two more changes are needed as far as I'm concerned.
Don't worry about the tests, they fail because our getitem/setitem is transitioning (among other things) from the old nonzero
format to the new one, so basically every operation fails. We will take care of this in the parent branch/PR.
Ahh okok. I was so lost becoz the changes I made had nothing to do with the errors I got. Thnkx for the clarification. |
I made this change because when the DNDarray is converted into a tuple, the meta-data is not really being passed on to the tuple members. This fix solves it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been losing (lengthy) comments so I'll submit this review now even if it's not quite finished and fix it later. Thanks @Mystic-Slice !
heat/core/indexing.py
Outdated
""" | ||
Return a :class:`~heat.core.dndarray.DNDarray` containing the indices of the elements that are non-zero.. (using ``torch.nonzero``) | ||
Return a Tuple of :class:`~heat.core.dndarray.DNDarray`s, one for each dimension of a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... one for each dimension of x
"
heat/core/indexing.py
Outdated
""" | ||
Return a :class:`~heat.core.dndarray.DNDarray` containing the indices of the elements that are non-zero.. (using ``torch.nonzero``) | ||
Return a Tuple of :class:`~heat.core.dndarray.DNDarray`s, one for each dimension of a, | ||
containing the indices of the non-zero elements in that dimension. (using ``torch.nonzero``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was there before you started working on it, but I would remove "(using torch.nonzero
)" now.
heat/core/indexing.py
Outdated
If ``x`` is split then the result is split in the 0th dimension. However, this :class:`~heat.core.dndarray.DNDarray` | ||
can be UNBALANCED as it contains the indices of the non-zero elements on each node. | ||
Returns an array with one entry for each dimension of ``x``, containing the indices of the non-zero elements in that dimension. | ||
The values in ``x`` are always tested and returned in row-major, C-style order. | ||
The values in ``x`` are always tested and returned in column-major, F-style order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are still tested in row-major, C-style order, otherwise nonzero
would return indices starting from the last dimension. Here's a good reference: Internal memory layout of an ndarray.
heat/core/indexing.py
Outdated
gout = list(lcl_nonzero.size()) | ||
gout[0] = x.comm.allreduce(gout[0], MPI.SUM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had written a lengthy detailed explanation of what's happening here, and then I lost it. Doh. Anyway.
The original torch.nonzero
output shape is (number_of_nonzero_elements, number_of_dimensions). The process-local lcl_nonzero
only know about how many nonzero elements are on process. But the global output DNDarray must know the total.
That allreduce
call replaces the local value of gout[0]
(local number of rows = local number of nonzero elements) with the sum of all gout[0]
on all processes, in order to synchronize gout
to the global (albeit memory-distributed) number of rows of the output.
Problem: lcl_nonzero
has been transposed, so it's no longer the rows, but the columns that represent the number of nonzero elements.
My suggestion:
- transpose before the
if split
check - adjust line 71
lcl_nonzero[..., x.split] += displs[x.comm.rank]
to correct the row (not the column) corresponding to the split dimension - the MPI collective call at line 78 should sum along the dimension containing the number of nonzero elements, now the columns
@ClaudiaComito Thanks for the detailed review. I have made the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Mystic-Slice, one more change needed and then I think we're good to go.
heat/core/indexing.py
Outdated
[ | ||
DNDarray( | ||
dim_indices, | ||
gshape=tuple(dim_indices.size()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gshape
needs to be the global size of the array (you calculated it with the allreduce
call), dim_indices
are local slices of the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh makes sense....will make the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mystic-Slice can you update the changelog? I'll merge this one into the parent branch afterwards. Thanks a lot!
Done :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks a lot @Mystic-Slice , I will merge this into the parent branch and we can take care of the tests with the rest.
eb297fb
into
helmholtz-analytics:914_adv-indexing-outshape-outsplit
Description
Switched out the torch's non-zero function with the NumPy version. Works as described in the issue ticket. Testing to be done.
Issue/s resolved: #925
Due Diligence
Does this change modify the behaviour of other functions? If so, which?
no
skip ci