Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DistributedTree Nearest query with callback #737

Open
wants to merge 53 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
8223d33
Add DistributedTree nearest example
masterleinad Aug 31, 2022
a6cf7e6
Add DistributedTree intersects example
masterleinad Aug 31, 2022
5cf9a5a
Add DistributedTree intersects with callback example
masterleinad Aug 31, 2022
6ce6cd2
avoid global_mpi_rank
masterleinad Aug 31, 2022
a552d22
Make sure communicateResults is batched by ranks
masterleinad Sep 2, 2022
8587a9d
Restoring behavior without calling callback
masterleinad Sep 2, 2022
3928d3f
Convert to lambda
masterleinad Sep 2, 2022
a637f27
Avoid casts
masterleinad Sep 2, 2022
ff91ac3
Compiling
masterleinad Sep 2, 2022
6aee1b5
Compiling and running for openmp with 1 MPI rank
masterleinad Sep 7, 2022
1e6ba19
Almost there
masterleinad Sep 8, 2022
68bc0d5
Fix up?
masterleinad Sep 8, 2022
eda8f6b
Remove output
masterleinad Sep 8, 2022
8379b23
Reimplement old funcionality with the new implementation
masterleinad Sep 9, 2022
9857b23
Fix up
masterleinad Sep 9, 2022
484db22
Add example
masterleinad Sep 9, 2022
526d581
Indentation
masterleinad Sep 9, 2022
c4ad321
Delete other examples
masterleinad Sep 9, 2022
c666cfb
Avoid global_mpi_rank
masterleinad Sep 9, 2022
82505da
Move QueriesWithIndices out of function
masterleinad Sep 9, 2022
86b426a
Fix memory spaces
masterleinad Sep 9, 2022
e0d0d79
Indentation
masterleinad Sep 9, 2022
0ac9cc4
Use MPIEXEC_PREFLAGS in installed examples
masterleinad Sep 12, 2022
38bf593
add more tests
masterleinad Sep 14, 2022
5a26209
Implement new overload in terms of the old one
masterleinad Sep 15, 2022
980404e
Indentation
masterleinad Sep 15, 2022
44ee75d
Cleanup
masterleinad Sep 15, 2022
b5c622d
More cleanup
masterleinad Sep 15, 2022
1c4beaf
InlinePrintCallback -> PrintAndInsert
masterleinad Sep 27, 2022
8d40931
Guard printf for SYCL
masterleinad Sep 27, 2022
b72b0c8
CTAD
masterleinad Sep 27, 2022
9402639
Improve comments
masterleinad Sep 27, 2022
d26ce07
Don't use lambdas
masterleinad Oct 11, 2022
3f66b82
Don't use Kokkos::pair<int, int> for distributed tree tests
masterleinad Oct 12, 2022
7ac6480
Fix execution spaces
masterleinad Oct 13, 2022
2d19d6e
Fix warnings for SYCL
masterleinad Oct 14, 2022
e2a899e
Fix more warnings
masterleinad Oct 14, 2022
74f4a30
Merge remote-tracking branch 'upstream/master' into distributed_knn_c…
masterleinad Jan 10, 2023
7452785
Don't add flags for CUDA-11.0.03-Clang build
masterleinad Jan 10, 2023
85ccd16
Fix compilation in ArborX_DetailsDistributedTreeImpl.hpp
masterleinad Jan 10, 2023
e6f43d5
Avoid spurious whiteline change
masterleinad Jan 10, 2023
591b89e
Add ArborX::DistributedTree::query::nearest_callback profiling section
masterleinad Jan 10, 2023
0baa3fd
Avoid Indices alias
masterleinad Jan 10, 2023
3ffb9a7
Rephrase strategy
masterleinad Jan 10, 2023
901cd91
postprocess_callback->execute_callback
masterleinad Jan 10, 2023
0601e4c
Assert that the callback is only called once per match
masterleinad Jan 10, 2023
ec544e0
Use camel case
masterleinad Jan 10, 2023
071f409
Intersection -> Match
masterleinad Jan 10, 2023
0d11bf1
Don't use ARBORX_ASSERT in device code
masterleinad Jan 11, 2023
dac4018
Also set MPI options for the gcc-12 build
masterleinad Jan 11, 2023
775e55b
Revert "Don't use Kokkos::pair<int, int> for distributed tree tests"
masterleinad Jan 24, 2023
fc59eca
Merge remote-tracking branch 'upstream/master' into distributed_knn_c…
masterleinad Jan 24, 2023
498d407
Fix warning
masterleinad Jan 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .jenkins/continuous.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,8 @@ pipeline {
-D CMAKE_CXX_COMPILER=$KOKKOS_DIR/bin/nvcc_wrapper \
-D CMAKE_CXX_EXTENSIONS=OFF \
-D CMAKE_PREFIX_PATH="$KOKKOS_DIR;$ARBORX_DIR" \
-D MPIEXEC_PREFLAGS="--allow-run-as-root" \
-D MPIEXEC_MAX_NUMPROCS=4 \
examples \
'''
sh 'make VERBOSE=1'
Expand Down Expand Up @@ -148,6 +150,8 @@ pipeline {
-D CMAKE_CXX_COMPILER=$KOKKOS_DIR/bin/nvcc_wrapper \
-D CMAKE_CXX_EXTENSIONS=OFF \
-D CMAKE_PREFIX_PATH="$KOKKOS_DIR;$ARBORX_DIR" \
-D MPIEXEC_PREFLAGS="--allow-run-as-root" \
-D MPIEXEC_MAX_NUMPROCS=4 \
examples \
'''
sh 'make VERBOSE=1'
Expand Down Expand Up @@ -205,6 +209,8 @@ pipeline {
-D CMAKE_CXX_COMPILER=clang++ \
-D CMAKE_CXX_EXTENSIONS=OFF \
-D CMAKE_PREFIX_PATH="$KOKKOS_DIR;$ARBORX_DIR" \
-D MPIEXEC_PREFLAGS="--allow-run-as-root" \
-D MPIEXEC_MAX_NUMPROCS=4 \
masterleinad marked this conversation as resolved.
Show resolved Hide resolved
examples \
'''
sh 'make VERBOSE=1'
Expand Down Expand Up @@ -266,6 +272,8 @@ pipeline {
-D CMAKE_CXX_COMPILER=clang++ \
-D CMAKE_CXX_EXTENSIONS=OFF \
-D CMAKE_PREFIX_PATH="$KOKKOS_DIR;$ARBORX_DIR" \
-D MPIEXEC_PREFLAGS="--allow-run-as-root" \
-D MPIEXEC_MAX_NUMPROCS=4 \
examples \
'''
sh 'make VERBOSE=1'
Expand Down Expand Up @@ -329,6 +337,8 @@ pipeline {
-D CMAKE_CXX_EXTENSIONS=OFF \
-D CMAKE_BUILD_TYPE=RelWithDebInfo \
-D CMAKE_PREFIX_PATH="$KOKKOS_DIR;$ARBORX_DIR" \
-D MPIEXEC_PREFLAGS="--allow-run-as-root" \
-D MPIEXEC_MAX_NUMPROCS=4 \
examples \
'''
sh 'make VERBOSE=1'
Expand Down Expand Up @@ -391,6 +401,8 @@ pipeline {
-D CMAKE_CXX_FLAGS="-Wno-unknown-cuda-version" \
-D CMAKE_PREFIX_PATH="$KOKKOS_DIR;$ARBORX_DIR;$ONE_DPL_DIR" \
-D ONEDPL_PAR_BACKEND=serial \
-D MPIEXEC_PREFLAGS="--allow-run-as-root" \
-D MPIEXEC_MAX_NUMPROCS=4 \
examples \
'''
sh 'make VERBOSE=1'
Expand Down
4 changes: 4 additions & 0 deletions examples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ add_subdirectory(simple_intersection)

add_subdirectory(molecular_dynamics)

if(ARBORX_ENABLE_MPI)
add_subdirectory(distributed_tree)
endif()

find_package(Boost COMPONENTS program_options)
if(Boost_FOUND)
add_subdirectory(viz)
Expand Down
3 changes: 3 additions & 0 deletions examples/distributed_tree/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
add_executable(ArborX_DistributedTree_KNNCallback.exe distributed_knn_callback.cpp)
target_link_libraries(ArborX_DistributedTree_KNNCallback.exe ArborX::ArborX)
add_test(NAME ArborX_DistributedTree_KNNCallback_Example COMMAND ${MPIEXEC_EXECUTABLE} ${MPIEXEC_NUMPROC_FLAG} ${MPIEXEC_MAX_NUMPROCS} ${MPIEXEC_PREFLAGS} ./ArborX_DistributedTree_KNNCallback.exe ${MPIEXEC_POSTFLAGS})
139 changes: 139 additions & 0 deletions examples/distributed_tree/distributed_knn_callback.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
/****************************************************************************
* Copyright (c) 2017-2022 by the ArborX authors *
* All rights reserved. *
* *
* This file is part of the ArborX library. ArborX is *
* distributed under a BSD 3-clause license. For the licensing terms see *
* the LICENSE file in the top-level directory. *
* *
* SPDX-License-Identifier: BSD-3-Clause *
****************************************************************************/

#include <ArborX.hpp>

#include <Kokkos_Core.hpp>

#include <cstdarg>
#include <cstdio>
#include <iostream>
#include <random>
#include <vector>

#include <mpi.h>

using ExecutionSpace = Kokkos::DefaultExecutionSpace;
using MemorySpace = ExecutionSpace::memory_space;

namespace Example
{
template <class Points>
struct Nearest
{
Points points;
int k;
int mpi_rank;
};
template <class Points>
Nearest(Points const &, int, int) -> Nearest<Points>;

struct IndexAndRank
{
int index;
int rank;
};

template <typename DeviceType>
struct PrintAndInsert
{
Kokkos::View<ArborX::Point *, DeviceType> points;
int mpi_rank;

PrintAndInsert(Kokkos::View<ArborX::Point *, DeviceType> const &points_,
int mpi_rank_)
: points(points_)
, mpi_rank(mpi_rank_)
{}

template <typename Predicate, typename OutputFunctor>
KOKKOS_FUNCTION void operator()(Predicate const &predicate,
int primitive_index,
OutputFunctor const &out) const
{
#ifndef KOKKOS_ENABLE_SYCL
auto data = ArborX::getData(predicate);
auto const &point = points(primitive_index);
printf("Intersection for query %d from MPI rank %d on MPI rank %d for "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to guard it with SYCL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like our saying "intersection". Commenting so we can get back to it.

"point %f,%f,%f with index %d\n",
data.index, data.rank, mpi_rank, point[0], point[1], point[2],
primitive_index);
#endif

out({primitive_index, mpi_rank});
}
};

} // namespace Example

template <class Points>
struct ArborX::AccessTraits<Example::Nearest<Points>, ArborX::PredicatesTag>
{
static KOKKOS_FUNCTION std::size_t size(Example::Nearest<Points> const &x)
{
return x.points.extent(0);
}
static KOKKOS_FUNCTION auto get(Example::Nearest<Points> const &x, int i)
{
return attach(ArborX::nearest(x.points(i), x.k),
Example::IndexAndRank{i, x.mpi_rank});
}
using memory_space = MemorySpace;
};

int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
Kokkos::initialize(argc, argv);
{
MPI_Comm comm = MPI_COMM_WORLD;
int comm_rank;
MPI_Comm_rank(comm, &comm_rank);
int comm_size;
MPI_Comm_size(comm, &comm_size);
ArborX::Point lower_left_corner = {static_cast<float>(comm_rank),
static_cast<float>(comm_rank),
static_cast<float>(comm_rank)};
ArborX::Point center = {static_cast<float>(comm_rank) + .5f,
static_cast<float>(comm_rank) + .5f,
static_cast<float>(comm_rank) + .5f};
std::vector points = {lower_left_corner, center};
auto points_device = Kokkos::create_mirror_view_and_copy(
MemorySpace{},
Kokkos::View<ArborX::Point *, Kokkos::HostSpace,
Kokkos::MemoryUnmanaged>(points.data(), points.size()));

ExecutionSpace exec;
ArborX::DistributedTree<MemorySpace> tree(comm, exec, points_device);

Kokkos::View<Example::IndexAndRank *, MemorySpace> values("values", 0);
Kokkos::View<int *, MemorySpace> offsets("offsets", 0);
tree.query(exec, Example::Nearest{points_device, 3, comm_rank},
Example::PrintAndInsert<MemorySpace>(points_device, comm_rank),
values, offsets);

auto host_values =
Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace{}, values);
auto host_offsets =
Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace{}, offsets);
for (unsigned int i = 0; i + 1 < host_offsets.size(); ++i)
{
std::cout << "Results for query " << i << " on MPI rank " << comm_rank
<< '\n';
for (int j = host_offsets(i); j < host_offsets(i + 1); ++j)
std::cout << "point " << host_values(j).index << ", rank "
<< host_values(j).rank << std::endl;
}
}
Kokkos::finalize();
MPI_Finalize();
return 0;
}
150 changes: 146 additions & 4 deletions src/details/ArborX_DetailsDistributedTreeImpl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,15 @@ struct DistributedTreeImpl
&distances);
}

template <typename DistributedTree, typename ExecutionSpace,
typename Predicates, typename OutputView, typename OffsetView,
typename Callback>
static std::enable_if_t<Kokkos::is_view<OutputView>{} &&
Kokkos::is_view<OffsetView>{}>
queryDispatch(NearestPredicateTag, DistributedTree const &tree,
ExecutionSpace const &space, Predicates const &queries,
Callback const &callback, OutputView &out, OffsetView &offset);

template <typename DistributedTree, typename ExecutionSpace,
typename Predicates, typename IndicesAndRanks, typename Offset>
static std::enable_if_t<Kokkos::is_view<IndicesAndRanks>{} &&
Expand All @@ -217,7 +226,6 @@ struct DistributedTreeImpl
ExecutionSpace const &space, Predicates const &queries,
IndicesAndRanks &values, Offset &offset)
{
// FIXME avoid zipping when distributed nearest callbacks become available
Kokkos::View<int *, ExecutionSpace> indices(
"ArborX::DistributedTree::query::nearest::indices", 0);
Kokkos::View<int *, ExecutionSpace> ranks(
Expand Down Expand Up @@ -358,8 +366,8 @@ void DistributedTreeImpl<DeviceType>::deviseStrategy(

// Accumulate total leave count in the local trees until it reaches k which
// is the number of neighbors queried for. Stop if local trees get
// empty because it means that they are no more leaves and there is no point
// on forwarding queries to leafless trees.
// empty because that means that there are no more leaves and there is no
// point in forwarding queries to leafless trees.
using Access = AccessTraits<Predicates, PredicatesTag>;
auto const n_queries = Access::size(queries);
Kokkos::View<int *, DeviceType> new_offset(
Expand Down Expand Up @@ -607,7 +615,6 @@ DistributedTreeImpl<DeviceType>::queryDispatchImpl(
Kokkos::Profiling::popRegion();
}
}

masterleinad marked this conversation as resolved.
Show resolved Hide resolved
Kokkos::Profiling::popRegion();
}

Expand Down Expand Up @@ -671,6 +678,141 @@ DistributedTreeImpl<DeviceType>::queryDispatch(
Kokkos::Profiling::popRegion();
}

template <typename Query>
struct QueriesWithIndices
{
Query query;
int query_id;
int primitive_index;
};

template <typename DeviceType>
template <typename DistributedTree, typename ExecutionSpace,
typename Predicates, typename OutputView, typename OffsetView,
typename Callback>
std::enable_if_t<Kokkos::is_view<OutputView>{} && Kokkos::is_view<OffsetView>{}>
DistributedTreeImpl<DeviceType>::queryDispatch(
NearestPredicateTag, DistributedTree const &tree,
ExecutionSpace const &space, Predicates const &queries,
Callback const &callback, OutputView &out, OffsetView &offset)
{
using Indices = Kokkos::View<int *, ExecutionSpace>;
masterleinad marked this conversation as resolved.
Show resolved Hide resolved
Indices indices("ArborX::DistributedTree::query::nearest::indices", 0);
masterleinad marked this conversation as resolved.
Show resolved Hide resolved
Kokkos::View<int *, ExecutionSpace> ranks(
"ArborX::DistributedTree::query::nearest::ranks", 0);

// The strategy for executing the callback only for the primitives that are
// closest for every predicate is as follows:
// - Find the ranks and indices for the nearest queries using the overload not
// taking a callback.
// - Send the predicate-primitive pairs to the process where the match was
// found.
// - Execute the callback on the process owning the primitives.
// - Send the result back to the process owning the predicates.
masterleinad marked this conversation as resolved.
Show resolved Hide resolved

// Find the ranks and indices for the nearest queries using the overload not
// taking a callback.
queryDispatchImpl(NearestPredicateTag{}, tree, space, queries, indices,
offset, ranks);
Comment on lines +672 to +673
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is deprecated, need to use the one that returns pairs of (index, rank).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the Impl version queryDispatchImpl is not deprecated (and I rather use this one internally).


Kokkos::Profiling::pushRegion(
"ArborX::DistributedTree::query::nearest::postprocess_callback");
masterleinad marked this conversation as resolved.
Show resolved Hide resolved

// Send the predicate-primitive pairs to the process where the match was
// found.
auto comm = tree.getComm();
int comm_rank;
MPI_Comm_rank(comm, &comm_rank);

using Access = AccessTraits<Predicates, PredicatesTag>;
using Query = typename AccessTraitsHelper<Access>::type;

Kokkos::View<QueriesWithIndices<Query> *, typename DeviceType::memory_space>
exported_queries_with_indices(
Kokkos::view_alloc(
space, Kokkos::WithoutInitializing,
"ArborX::DistributedTree::query::exported_queries_with_indices"),
ranks.size());
Kokkos::parallel_for(
"ArborX::DistributedTree::query::zip_queries_and_primitives",
Kokkos::RangePolicy<ExecutionSpace>(space, 0, Access::size(queries)),
KOKKOS_LAMBDA(int i) {
using index_type = typename OffsetView::value_type;
for (index_type i = offset(q); i < offset(q + 1); ++i)
exported_queries_with_indices(i) = {Access::get(queries, q), q,
indices(i)};
});

Distributor<DeviceType> distributor(comm);
auto const n_imports = distributor.createFromSends(space, ranks);

Kokkos::View<QueriesWithIndices<Query> *, typename DeviceType::memory_space>
imported_queries_with_indices(
Kokkos::view_alloc(
space, Kokkos::WithoutInitializing,
"ArborX::DistributedTree::query::imported_queries_with_indices"),
n_imports);

sendAcrossNetwork(space, distributor, exported_queries_with_indices,
imported_queries_with_indices);

// Execute the callback on the process owning the primitives.
OutputView remote_out(
Kokkos::view_alloc(space, Kokkos::WithoutInitializing,
"ArborX::DistributedTree::query::remote_out"),
n_imports);
KokkosExt::reallocWithoutInitializing(space, indices, n_imports);
Kokkos::parallel_for(
"ArborX::DistributedTree::query::execute_callbacks",
Kokkos::RangePolicy<ExecutionSpace>(space, 0,
imported_queries_with_indices.size()),
KOKKOS_LAMBDA(int i) {
callback(imported_queries_with_indices(i).query,
imported_queries_with_indices(i).primitive_index,
[&](typename OutputView::value_type const &value) {
remote_out(i) = value;
indices(i) = imported_queries_with_indices(i).query_id;
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong for a few reasons.

First, it assumes that callback will output a single element for each match. In practice, we don't know how many elements the callback will spit out per query. It could be 0, or it could be more than 1. So we need to do a two pass here, similar to the CrsGraphWrapperImpl. I'm not sure what is the easiest way to do it without massive code duplication.

Second, it assumes inline callback and not a post one. We could try to support post callback, but its meaning would have to be changed as we cannot execute it on all of the results coming from different ranks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just assume that the callback only returns a single element to get started, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. The thing is, this condition is undetectable, so the failure for a user would likely be an undefined behavior or a segfault. Which in the distributed setting would be tricky to debug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assertion.


// Send the result back to the process owning the predicates.
Distributor<DeviceType> back_distributor(comm);
auto const &dest = distributor.get_sources();
auto const &off = distributor.get_source_offsets();

Kokkos::View<int const *, Kokkos::HostSpace> host_destinations(dest.data(),
dest.size());
Kokkos::View<int const *, Kokkos::HostSpace> host_offsets(off.data(),
off.size());
typename DeviceType::memory_space memory_space;
Kokkos::View<int *, DeviceType> destinations(
Kokkos::view_alloc(space, Kokkos::WithoutInitializing,
"ArborX::DistributedTree::query::destinations"),
dest.size());
Kokkos::deep_copy(space, destinations, host_destinations);
Kokkos::View<int *, DeviceType> offsets(
Kokkos::view_alloc(space, Kokkos::WithoutInitializing,
"ArborX::DistributedTree::query::offsets"),
off.size());
Kokkos::deep_copy(space, offsets, host_offsets);
auto const n_imports_back =
back_distributor.createFromSends(space, destinations, offsets);
KokkosExt::reallocWithoutInitializing(space, out, n_imports_back);
Kokkos::View<int *, DeviceType> query_ids(
Kokkos::view_alloc(space, Kokkos::WithoutInitializing,
"ArborX::DistributedTree::query::nearest::query_ids"),
n_imports_back);

Comment on lines +752 to +770
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified by using UnmanagedView from the distributor data pointers, and doing an immediate deep_copoy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what I'm doing already?

// FIXME does combining communication here help?
sendAcrossNetwork(space, back_distributor, remote_out, out);
sendAcrossNetwork(space, back_distributor, indices, query_ids);

auto const permutation = ArborX::Details::sortObjects(space, query_ids);
ArborX::Details::applyPermutation(space, permutation, out);

Kokkos::Profiling::popRegion();
}

template <typename DeviceType>
template <typename ExecutionSpace, typename View, typename... OtherViews>
void DistributedTreeImpl<DeviceType>::sortResults(ExecutionSpace const &space,
Expand Down
Loading