Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specific communicator for neighborhood communication #1588

Open
wants to merge 10 commits into
base: index-map-pgm
Choose a base branch
from

Conversation

MarcelKoch
Copy link
Member

@MarcelKoch MarcelKoch commented Apr 4, 2024

This PR adds a communicator that only handles neighborhood all-to-all communication. It implements the new interface collective_communicator, which provides different implementations for a selected set of collective mpi routines. Currently, this only includes the non-blocking all-to-all.

The communication uses a fixed pattern, i.e. the send/recv sizes are fixed when the neighborhood communicator is constructed. I would have liked to decouple that, but this would require some knowledge of how the sizes are stored at the interface level. If someone has an idea for that, please let me know.

This is the first part of splitting up #1546.

The neighborhood all-to-all has a bug in OpenMPI < v4.1.0, which makes it necessary to disable the neighborhood communicator in this case. As replacement, there is also a dense all-to-all communicator.

Todo:

  • documentation

PR Stack:

@MarcelKoch MarcelKoch self-assigned this Apr 4, 2024
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. labels Apr 4, 2024
@MarcelKoch MarcelKoch requested a review from pratikvn April 4, 2024 10:41
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from b42ab92 to 8f104fd Compare April 4, 2024 11:00
@MarcelKoch MarcelKoch modified the milestone: Ginkgo 1.8.0 Apr 5, 2024
@MarcelKoch MarcelKoch requested a review from upsj April 19, 2024 09:19
@MarcelKoch MarcelKoch mentioned this pull request Apr 19, 2024
7 tasks
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from 8f104fd to a0824a8 Compare April 19, 2024 14:39
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from a0824a8 to 8ad3f2f Compare April 22, 2024 11:11
@MarcelKoch MarcelKoch mentioned this pull request Apr 30, 2024
6 tasks
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from 8ad3f2f to 26678b3 Compare April 30, 2024 13:41
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from 26678b3 to 006d67d Compare April 30, 2024 15:20
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from 006d67d to b295b11 Compare May 2, 2024 10:04
@MarcelKoch
Copy link
Member Author

So right now, I'm leaning to just deactivate the neighborhood communicator if the openmpi version isn't sufficient. The CollectiveCommunicator interface is meant to provide different implementations of the all-to-all call, so using a dense all-to-all here feels odd. I think I would rather have a derived class that only does the dense all-to-all and use that instead.

@pratikvn
Copy link
Member

Yes, but the fix is missing in older versions of 4.0.x as well. I am not entirely sure which exact version of OpenMPI our container uses, but it is 4.0.x, and 4.0.7 is still missing the fix. I think the fix was added only in 4.1.x.

But in general, I agree. I think we can just disable the neighborhood communicator for older versions of OpenMPI.

Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Some missing tests and doc updates needed

core/distributed/dense_communicator.cpp Show resolved Hide resolved
include/ginkgo/core/distributed/dense_communicator.hpp Outdated Show resolved Hide resolved
@MarcelKoch MarcelKoch requested review from upsj and removed request for upsj August 27, 2024 12:05
Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, mostly interface and implementation

  • What is the moved-from state of a Communicator? Should it match that of an MPI communicator wrapper, and be MPI_NULL, or preserve the MPI communicator?

* @return a request handle
*/
template <typename SendType, typename RecvType>
request i_all_to_all_v(std::shared_ptr<const Executor> exec,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably also be [[nodiscard]]?

Comment on lines +64 to +67
virtual request i_all_to_all_v(std::shared_ptr<const Executor> exec,
const void* send_buffer,
MPI_Datatype send_type, void* recv_buffer,
MPI_Datatype recv_type) const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this an internal implementation detail like we do with LinOp::apply_impl? That will make future generic logging of these events easier.

Comment on lines +75 to +78
std::fill(other.send_sizes_.begin(), other.send_sizes_.end(), 0);
std::fill(other.send_offsets_.begin(), other.send_offsets_.end(), 0);
std::fill(other.recv_sizes_.begin(), other.recv_sizes_.end(), 0);
std::fill(other.recv_offsets_.begin(), other.recv_offsets_.end(), 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is move-assigning the base supposed to preserve the communicator? Otherwise we should probably resize them to 0

Comment on lines +154 to +161
std::equal(a.send_sizes_.begin(), a.send_sizes_.end(),
b.send_sizes_.begin()) &&
std::equal(a.recv_sizes_.begin(), a.recv_sizes_.end(),
b.recv_sizes_.begin()) &&
std::equal(a.send_offsets_.begin(), a.send_offsets_.end(),
b.send_offsets_.begin()) &&
std::equal(a.recv_offsets_.begin(), a.recv_offsets_.end(),
b.recv_offsets_.begin());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is std::vector::operator== sufficient here? Also do you need this somewhere beyond tests?

Comment on lines +223 to +230
std::equal(a.send_sizes_.begin(), a.send_sizes_.end(),
b.send_sizes_.begin()) &&
std::equal(a.recv_sizes_.begin(), a.recv_sizes_.end(),
b.recv_sizes_.begin()) &&
std::equal(a.send_offsets_.begin(), a.send_offsets_.end(),
b.send_offsets_.begin()) &&
std::equal(a.recv_offsets_.begin(), a.recv_offsets_.end(),
b.recv_offsets_.begin());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operator==?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-for-review This PR is ready for review mod:core This is related to the core module. reg:build This is related to the build system. reg:testing This is related to testing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants