Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #925: ht.nonzero() returns tuple of 1-D arrays instead of n-D arrays #937

Conversation

Mystic-Slice
Copy link
Collaborator

@Mystic-Slice Mystic-Slice commented Mar 23, 2022

Description

Switched out the torch's non-zero function with the NumPy version. Works as described in the issue ticket. Testing to be done.

Issue/s resolved: #925

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

no

skip ci

@mtar
Copy link
Collaborator

mtar commented Mar 23, 2022

GPU cluster tests are currently disabled on this Pull Request.

@Mystic-Slice
Copy link
Collaborator Author

Hi @ClaudiaComito. This is the solution that I could come up with at first glance. It works well for the arrays I tested. But I still have to test different edge-cases like 1-D arrays, etc... Just thought I could get your input before proceeding further.
I also don't know if I handled the split, gout, and other things correctly. Let me know what you think.

@Mystic-Slice
Copy link
Collaborator Author

My fix failed for 1-D arrays. I think I fixed it now!

@Mystic-Slice Mystic-Slice changed the title fix #925: Nonzero function works like the numpy counterpart fix #925: ht.nonzero() returns tuple of 1-D arrays instead of n-D arrays Mar 23, 2022
@Mystic-Slice
Copy link
Collaborator Author

I will update the documentation and the tests a little bit later.

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Mystic-Slice thanks a lot for diving into this right away!

Note that we only use torch operations internally. Our tensors might reside on GPUs, and a numpy operation would require copying the tensor to CPU.

heat/core/indexing.py Outdated Show resolved Hide resolved
heat/core/indexing.py Outdated Show resolved Hide resolved
heat/core/indexing.py Outdated Show resolved Hide resolved
@Mystic-Slice Mystic-Slice reopened this Mar 23, 2022
Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done @Mystic-Slice, we're getting close, thanks a lot! The tests will be more work because many existing functions rely on the previous nonzero format.

heat/core/indexing.py Outdated Show resolved Hide resolved
heat/core/tests/test_indexing.py Outdated Show resolved Hide resolved
@ClaudiaComito ClaudiaComito changed the base branch from main to 914_adv-indexing-outshape-outsplit March 25, 2022 08:43
@ClaudiaComito
Copy link
Contributor

@Mystic-Slice I changed the base branch of this PR, because the change you implemented is needed in our indexing overhaul PR #938.

I've taken care of resolving conflicts (hence the commits). I hope you don't mind.

@Mystic-Slice
Copy link
Collaborator Author

Not at all.
I still can make the required changes to this branch right?

@ClaudiaComito
Copy link
Contributor

Not at all. I still can make the required changes to this branch right?

Absolutely!

@Mystic-Slice
Copy link
Collaborator Author

@ClaudiaComito
I don't know if we even need to keep track of gout (shape of the output), split, etc...
Because anyways we convert it into a tuple and we lose all this information.
What do you think?

@Mystic-Slice
Copy link
Collaborator Author

@ClaudiaComito Hi, I made the fixes. But I am not able to get rid of a few errors that arise in testing. Can you please take a look?

@ClaudiaComito
Copy link
Contributor

@ClaudiaComito I don't know if we even need to keep track of gout (shape of the output), split, etc... Because anyways we convert it into a tuple and we lose all this information. What do you think?

@Mystic-Slice , the DNDarray metadata always have to be correct. The information doesn't get lost, it gets used in all subsequent operations, and the returned tuple is a tuple of DNDarrays whose metadata will be based on the original one.

comm=x.comm,
balanced=False,
lcl_nonzero = lcl_nonzero.transpose(0, 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gout must reflect the transposed global shape

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mystic-Slice thanks for the changes. Just two more changes are needed as far as I'm concerned.

Don't worry about the tests, they fail because our getitem/setitem is transitioning (among other things) from the old nonzero format to the new one, so basically every operation fails. We will take care of this in the parent branch/PR.

@Mystic-Slice
Copy link
Collaborator Author

@Mystic-Slice thanks for the changes. Just two more changes are needed as far as I'm concerned.

Don't worry about the tests, they fail because our getitem/setitem is transitioning (among other things) from the old nonzero format to the new one, so basically every operation fails. We will take care of this in the parent branch/PR.

Ahh okok. I was so lost becoz the changes I made had nothing to do with the errors I got. Thnkx for the clarification.
I have made the changes you requested.

@Mystic-Slice
Copy link
Collaborator Author

I made this change because when the DNDarray is converted into a tuple, the meta-data is not really being passed on to the tuple members. This fix solves it.

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been losing (lengthy) comments so I'll submit this review now even if it's not quite finished and fix it later. Thanks @Mystic-Slice !

"""
Return a :class:`~heat.core.dndarray.DNDarray` containing the indices of the elements that are non-zero.. (using ``torch.nonzero``)
Return a Tuple of :class:`~heat.core.dndarray.DNDarray`s, one for each dimension of a,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... one for each dimension of x"

"""
Return a :class:`~heat.core.dndarray.DNDarray` containing the indices of the elements that are non-zero.. (using ``torch.nonzero``)
Return a Tuple of :class:`~heat.core.dndarray.DNDarray`s, one for each dimension of a,
containing the indices of the non-zero elements in that dimension. (using ``torch.nonzero``)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was there before you started working on it, but I would remove "(using torch.nonzero)" now.

If ``x`` is split then the result is split in the 0th dimension. However, this :class:`~heat.core.dndarray.DNDarray`
can be UNBALANCED as it contains the indices of the non-zero elements on each node.
Returns an array with one entry for each dimension of ``x``, containing the indices of the non-zero elements in that dimension.
The values in ``x`` are always tested and returned in row-major, C-style order.
The values in ``x`` are always tested and returned in column-major, F-style order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are still tested in row-major, C-style order, otherwise nonzero would return indices starting from the last dimension. Here's a good reference: Internal memory layout of an ndarray.

Comment on lines 77 to 78
gout = list(lcl_nonzero.size())
gout[0] = x.comm.allreduce(gout[0], MPI.SUM)
Copy link
Contributor

@ClaudiaComito ClaudiaComito Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had written a lengthy detailed explanation of what's happening here, and then I lost it. Doh. Anyway.

The original torch.nonzero output shape is (number_of_nonzero_elements, number_of_dimensions). The process-local lcl_nonzero only know about how many nonzero elements are on process. But the global output DNDarray must know the total.

That allreduce call replaces the local value of gout[0] (local number of rows = local number of nonzero elements) with the sum of all gout[0] on all processes, in order to synchronize gout to the global (albeit memory-distributed) number of rows of the output.

Problem: lcl_nonzero has been transposed, so it's no longer the rows, but the columns that represent the number of nonzero elements.

My suggestion:

  • transpose before the if split check
  • adjust line 71 lcl_nonzero[..., x.split] += displs[x.comm.rank] to correct the row (not the column) corresponding to the split dimension
  • the MPI collective call at line 78 should sum along the dimension containing the number of nonzero elements, now the columns

@Mystic-Slice
Copy link
Collaborator Author

@ClaudiaComito Thanks for the detailed review. I have made the changes.

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mystic-Slice, one more change needed and then I think we're good to go.

[
DNDarray(
dim_indices,
gshape=tuple(dim_indices.size()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gshape needs to be the global size of the array (you calculated it with the allreduce call), dim_indices are local slices of the array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh makes sense....will make the change

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mystic-Slice can you update the changelog? I'll merge this one into the parent branch afterwards. Thanks a lot!

@Mystic-Slice
Copy link
Collaborator Author

@Mystic-Slice can you update the changelog? I'll merge this one into the parent branch afterwards. Thanks a lot!

Done :)

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks a lot @Mystic-Slice , I will merge this into the parent branch and we can take care of the tests with the rest.

@ClaudiaComito ClaudiaComito merged commit eb297fb into helmholtz-analytics:914_adv-indexing-outshape-outsplit Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ht.nonzero() should return tuple of 1-D arrays instead of an n-D array
5 participants