Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2240: Add Rabenseifner and Recursive doubling allreduce algorithms for ObjGroup #2272

Conversation

JacobDomagala
Copy link
Contributor

Fixes #2240

@JacobDomagala JacobDomagala self-assigned this Apr 15, 2024
@JacobDomagala JacobDomagala linked an issue Apr 15, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Apr 15, 2024

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for 16da9e7 (2024-04-16 12:06:29 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-12, ubuntu, mpich, verbose)

Build for 5ce7cd9 (2024-06-18 16:35:02 UTC)

FAILED: tests/CMakeFiles/allreduce.dir/perf/allreduce.cc.o 
/usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -DVT_NO_COLOR_ENABLED -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -isystem /build/checkpoint/install/include -O3 -DNDEBUG -fdiagnostics-color=always -std=c++17 -MD -MT tests/CMakeFiles/allreduce.dir/perf/allreduce.cc.o -MF tests/CMakeFiles/allreduce.dir/perf/allreduce.cc.o.d -o tests/CMakeFiles/allreduce.dir/perf/allreduce.cc.o -c /vt/tests/perf/allreduce.cc
/vt/tests/perf/allreduce.cc:49:10: fatal error: Kokkos_Core.hpp: No such file or directory
   49 | #include <Kokkos_Core.hpp>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.


Build log


PR tests (gcc-12, ubuntu, mpich, verbose, kokkos)

Build for 57b8cab (2024-07-18 15:27:27 UTC)

Compilation - successful

Testing - passed

Build log


@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch 2 times, most recently from 17cfcbd to 16da9e7 Compare April 16, 2024 12:06
@JacobDomagala
Copy link
Contributor Author

Results of running allreduce on std::vector<int32_t> with 65536 elements

RUNNING TEST: test_reduce (Number of runs = 25) ...

Test results for test_reduce running on 16 nodes:
[7] Results for test_reduce (avg: 1.730ms stdev: 0.526ms min: 1.359ms max: 4.086ms)
[10] Results for test_reduce (avg: 1.733ms stdev: 0.532ms min: 1.359ms max: 4.120ms)
[6] Results for test_reduce (avg: 1.734ms stdev: 0.532ms min: 1.360ms max: 4.118ms)
[0] Results for test_reduce (avg: 1.759ms stdev: 0.527ms min: 1.395ms max: 4.117ms)
[13] Results for test_reduce (avg: 1.768ms stdev: 0.535ms min: 1.396ms max: 4.178ms)
[1] Results for test_reduce (avg: 1.767ms stdev: 0.536ms min: 1.392ms max: 4.175ms)
[14] Results for test_reduce (avg: 1.770ms stdev: 0.537ms min: 1.398ms max: 4.186ms)
[15] Results for test_reduce (avg: 1.762ms stdev: 0.524ms min: 1.397ms max: 4.144ms)
[2] Results for test_reduce (avg: 1.729ms stdev: 0.531ms min: 1.358ms max: 4.116ms)
[9] Results for test_reduce (avg: 1.768ms stdev: 0.536ms min: 1.394ms max: 4.178ms)
[8] Results for test_reduce (avg: 1.771ms stdev: 0.536ms min: 1.398ms max: 4.180ms)
[4] Results for test_reduce (avg: 1.735ms stdev: 0.529ms min: 1.358ms max: 4.097ms)
[3] Results for test_reduce (avg: 1.729ms stdev: 0.527ms min: 1.357ms max: 4.087ms)
[5] Results for test_reduce (avg: 1.756ms stdev: 0.538ms min: 1.380ms max: 4.173ms)
[11] Results for test_reduce (avg: 1.734ms stdev: 0.532ms min: 1.363ms max: 4.120ms)
[12] Results for test_reduce (avg: 1.733ms stdev: 0.531ms min: 1.360ms max: 4.110ms)


RUNNING TEST: test_allreduce_rabenseifner (Number of runs = 25) ...

Test results for test_allreduce_rabenseifner running on 16 nodes:
[7] Results for test_allreduce_rabenseifner (avg: 1.227ms stdev: 0.082ms min: 1.084ms max: 1.430ms)
[4] Results for test_allreduce_rabenseifner (avg: 1.226ms stdev: 0.078ms min: 1.084ms max: 1.398ms)
[6] Results for test_allreduce_rabenseifner (avg: 1.230ms stdev: 0.082ms min: 1.087ms max: 1.441ms)
[0] Results for test_allreduce_rabenseifner (avg: 1.248ms stdev: 0.081ms min: 1.104ms max: 1.442ms)
[13] Results for test_allreduce_rabenseifner (avg: 1.249ms stdev: 0.082ms min: 1.109ms max: 1.447ms)
[1] Results for test_allreduce_rabenseifner (avg: 1.249ms stdev: 0.084ms min: 1.104ms max: 1.471ms)
[14] Results for test_allreduce_rabenseifner (avg: 1.248ms stdev: 0.080ms min: 1.109ms max: 1.444ms)
[5] Results for test_allreduce_rabenseifner (avg: 1.244ms stdev: 0.081ms min: 1.098ms max: 1.439ms)
[9] Results for test_allreduce_rabenseifner (avg: 1.250ms stdev: 0.081ms min: 1.107ms max: 1.449ms)
[15] Results for test_allreduce_rabenseifner (avg: 1.250ms stdev: 0.081ms min: 1.108ms max: 1.451ms)
[2] Results for test_allreduce_rabenseifner (avg: 1.226ms stdev: 0.082ms min: 1.085ms max: 1.429ms)
[8] Results for test_allreduce_rabenseifner (avg: 1.251ms stdev: 0.081ms min: 1.108ms max: 1.447ms)
[11] Results for test_allreduce_rabenseifner (avg: 1.231ms stdev: 0.080ms min: 1.089ms max: 1.420ms)
[3] Results for test_allreduce_rabenseifner (avg: 1.226ms stdev: 0.081ms min: 1.083ms max: 1.429ms)
[12] Results for test_allreduce_rabenseifner (avg: 1.227ms stdev: 0.082ms min: 1.087ms max: 1.413ms)
[10] Results for test_allreduce_rabenseifner (avg: 1.228ms stdev: 0.079ms min: 1.087ms max: 1.410ms)


RUNNING TEST: test_allreduce_recursive_doubling (Number of runs = 25) ...

Test results for test_allreduce_recursive_doubling running on 16 nodes:
[12] Results for test_allreduce_recursive_doubling (avg: 1.888ms stdev: 0.163ms min: 1.699ms max: 2.383ms)
[4] Results for test_allreduce_recursive_doubling (avg: 1.884ms stdev: 0.163ms min: 1.697ms max: 2.375ms)
[6] Results for test_allreduce_recursive_doubling (avg: 1.886ms stdev: 0.160ms min: 1.699ms max: 2.383ms)
[0] Results for test_allreduce_recursive_doubling (avg: 1.930ms stdev: 0.160ms min: 1.744ms max: 2.426ms)
[13] Results for test_allreduce_recursive_doubling (avg: 1.926ms stdev: 0.167ms min: 1.705ms max: 2.426ms)
[15] Results for test_allreduce_recursive_doubling (avg: 1.932ms stdev: 0.162ms min: 1.743ms max: 2.424ms)
[2] Results for test_allreduce_recursive_doubling (avg: 1.884ms stdev: 0.163ms min: 1.694ms max: 2.380ms)
[1] Results for test_allreduce_recursive_doubling (avg: 1.930ms stdev: 0.163ms min: 1.739ms max: 2.427ms)
[14] Results for test_allreduce_recursive_doubling (avg: 1.931ms stdev: 0.162ms min: 1.744ms max: 2.428ms)
[9] Results for test_allreduce_recursive_doubling (avg: 1.927ms stdev: 0.164ms min: 1.739ms max: 2.425ms)
[5] Results for test_allreduce_recursive_doubling (avg: 1.931ms stdev: 0.163ms min: 1.741ms max: 2.426ms)
[8] Results for test_allreduce_recursive_doubling (avg: 1.934ms stdev: 0.162ms min: 1.748ms max: 2.425ms)
[11] Results for test_allreduce_recursive_doubling (avg: 1.924ms stdev: 0.164ms min: 1.732ms max: 2.425ms)
[3] Results for test_allreduce_recursive_doubling (avg: 1.884ms stdev: 0.163ms min: 1.695ms max: 2.380ms)
[10] Results for test_allreduce_recursive_doubling (avg: 1.887ms stdev: 0.164ms min: 1.699ms max: 2.381ms)
[7] Results for test_allreduce_recursive_doubling (avg: 1.882ms stdev: 0.165ms min: 1.696ms max: 2.379ms)

@JacobDomagala
Copy link
Contributor Author

Results of running allreduce on std::vector<int32_t> with 2 elems


Test results for test_reduce running on 16 nodes:
[14] Results for test_reduce (avg: 0.147ms stdev: 0.030ms min: 0.128ms max: 0.263ms)
[11] Results for test_reduce (avg: 0.146ms stdev: 0.029ms min: 0.128ms max: 0.260ms)
[15] Results for test_reduce (avg: 0.146ms stdev: 0.030ms min: 0.128ms max: 0.264ms)
[0] Results for test_reduce (avg: 0.144ms stdev: 0.029ms min: 0.125ms max: 0.259ms)
[13] Results for test_reduce (avg: 0.147ms stdev: 0.029ms min: 0.127ms max: 0.261ms)
[6] Results for test_reduce (avg: 0.146ms stdev: 0.030ms min: 0.127ms max: 0.263ms)
[8] Results for test_reduce (avg: 0.147ms stdev: 0.030ms min: 0.127ms max: 0.268ms)
[9] Results for test_reduce (avg: 0.146ms stdev: 0.030ms min: 0.127ms max: 0.265ms)
[4] Results for test_reduce (avg: 0.146ms stdev: 0.030ms min: 0.127ms max: 0.267ms)
[10] Results for test_reduce (avg: 0.147ms stdev: 0.030ms min: 0.128ms max: 0.266ms)
[5] Results for test_reduce (avg: 0.146ms stdev: 0.030ms min: 0.127ms max: 0.265ms)
[3] Results for test_reduce (avg: 0.145ms stdev: 0.030ms min: 0.127ms max: 0.265ms)
[7] Results for test_reduce (avg: 0.145ms stdev: 0.030ms min: 0.127ms max: 0.263ms)
[2] Results for test_reduce (avg: 0.146ms stdev: 0.029ms min: 0.128ms max: 0.259ms)
[12] Results for test_reduce (avg: 0.147ms stdev: 0.030ms min: 0.129ms max: 0.268ms)
[1] Results for test_reduce (avg: 0.146ms stdev: 0.030ms min: 0.127ms max: 0.262ms)


RUNNING TEST: test_allreduce_rabenseifner (Number of runs = 25) ...

Test results for test_allreduce_rabenseifner running on 16 nodes:
[11] Results for test_allreduce_rabenseifner (avg: 0.143ms stdev: 0.011ms min: 0.135ms max: 0.184ms)
[5] Results for test_allreduce_rabenseifner (avg: 0.143ms stdev: 0.011ms min: 0.135ms max: 0.183ms)
[13] Results for test_allreduce_rabenseifner (avg: 0.143ms stdev: 0.011ms min: 0.135ms max: 0.183ms)
[0] Results for test_allreduce_rabenseifner (avg: 0.141ms stdev: 0.011ms min: 0.133ms max: 0.181ms)
[2] Results for test_allreduce_rabenseifner (avg: 0.142ms stdev: 0.011ms min: 0.133ms max: 0.183ms)
[15] Results for test_allreduce_rabenseifner (avg: 0.143ms stdev: 0.011ms min: 0.135ms max: 0.184ms)
[6] Results for test_allreduce_rabenseifner (avg: 0.142ms stdev: 0.012ms min: 0.134ms max: 0.184ms)
[8] Results for test_allreduce_rabenseifner (avg: 0.143ms stdev: 0.012ms min: 0.134ms max: 0.185ms)
[4] Results for test_allreduce_rabenseifner (avg: 0.142ms stdev: 0.012ms min: 0.134ms max: 0.183ms)
[3] Results for test_allreduce_rabenseifner (avg: 0.141ms stdev: 0.012ms min: 0.133ms max: 0.183ms)
[7] Results for test_allreduce_rabenseifner (avg: 0.142ms stdev: 0.012ms min: 0.134ms max: 0.183ms)
[9] Results for test_allreduce_rabenseifner (avg: 0.142ms stdev: 0.011ms min: 0.134ms max: 0.183ms)
[12] Results for test_allreduce_rabenseifner (avg: 0.144ms stdev: 0.011ms min: 0.136ms max: 0.184ms)
[14] Results for test_allreduce_rabenseifner (avg: 0.143ms stdev: 0.011ms min: 0.135ms max: 0.184ms)
[1] Results for test_allreduce_rabenseifner (avg: 0.142ms stdev: 0.011ms min: 0.134ms max: 0.182ms)
[10] Results for test_allreduce_rabenseifner (avg: 0.143ms stdev: 0.012ms min: 0.135ms max: 0.184ms)

RUNNING TEST: test_allreduce_recursive_doubling (Number of runs = 25) ...

Test results for test_allreduce_recursive_doubling running on 16 nodes:
[11] Results for test_allreduce_recursive_doubling (avg: 0.117ms stdev: 0.061ms min: 0.092ms max: 0.391ms)
[5] Results for test_allreduce_recursive_doubling (avg: 0.117ms stdev: 0.061ms min: 0.093ms max: 0.391ms)
[13] Results for test_allreduce_recursive_doubling (avg: 0.116ms stdev: 0.061ms min: 0.092ms max: 0.391ms)
[0] Results for test_allreduce_recursive_doubling (avg: 0.114ms stdev: 0.061ms min: 0.090ms max: 0.387ms)
[2] Results for test_allreduce_recursive_doubling (avg: 0.115ms stdev: 0.061ms min: 0.090ms max: 0.389ms)
[15] Results for test_allreduce_recursive_doubling (avg: 0.116ms stdev: 0.061ms min: 0.092ms max: 0.388ms)
[8] Results for test_allreduce_recursive_doubling (avg: 0.116ms stdev: 0.061ms min: 0.091ms max: 0.389ms)
[4] Results for test_allreduce_recursive_doubling (avg: 0.116ms stdev: 0.061ms min: 0.091ms max: 0.388ms)
[6] Results for test_allreduce_recursive_doubling (avg: 0.116ms stdev: 0.061ms min: 0.091ms max: 0.390ms)
[1] Results for test_allreduce_recursive_doubling (avg: 0.115ms stdev: 0.061ms min: 0.091ms max: 0.387ms)
[14] Results for test_allreduce_recursive_doubling (avg: 0.116ms stdev: 0.061ms min: 0.092ms max: 0.390ms)
[7] Results for test_allreduce_recursive_doubling (avg: 0.115ms stdev: 0.061ms min: 0.091ms max: 0.389ms)
[9] Results for test_allreduce_recursive_doubling (avg: 0.116ms stdev: 0.061ms min: 0.091ms max: 0.389ms)
[3] Results for test_allreduce_recursive_doubling (avg: 0.115ms stdev: 0.061ms min: 0.090ms max: 0.388ms)
[10] Results for test_allreduce_recursive_doubling (avg: 0.116ms stdev: 0.061ms min: 0.092ms max: 0.389ms)
[12] Results for test_allreduce_recursive_doubling (avg: 0.117ms stdev: 0.061ms min: 0.093ms max: 0.393ms)

@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch 2 times, most recently from 53da893 to e3fa49b Compare May 7, 2024 16:11
@JacobDomagala
Copy link
Contributor Author

Still missing:

  • cache the allreduce ObjGroup that we created for allreduce messages
  • move the allredcue logic to proper header files (for now they reside in performance test file, to speedup re-compiling tests)
  • write unit tests

@ppebay ppebay changed the title 2240: Add Rabenseifner and Recursive doubling allreduce algorithms for ObjGroup #2240: Add Rabenseifner and Recursive doubling allreduce algorithms for ObjGroup May 16, 2024
@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch 2 times, most recently from 80d6712 to f01bad5 Compare May 21, 2024 16:04
@JacobDomagala
Copy link
Contributor Author

JacobDomagala commented May 27, 2024

Regarding the issue with the Rabenseifner algorithm, I was thinking maybe we could try to introduce some kind of wrapper for various data types. We could add specializations for known common types (e.g., std::vector, kokkos::View, etc.).

If users want to use their custom wrapper, then they should provide size, at, set functions (and probably few more that allow for data splitting). If they want to use the Rabenseifner algorithm, we could add a constexpr check for that, and if it fails, then we fallback to reduce->bcast or Recursive Doubling.

#include <vector>

#ifdef VT_KOKKOS_ENABLED
#include <Kokkos_Core.hpp>
#endif

template <typename Container>
class DataHandler {
public:
  using Scalar = float;

  static size_t size(const Container& data);
  static Scalar& at(Container& data, size_t idx);
  static void set(Container& data, size_t idx, const Scalar& value);
  static Container split(Container& data, size_t start, size_t end);
};

template <typename T>
class DataHandler<std::vector<T>> {
public:
  using Scalar = T;
  static size_t size(const std::vector<T>& data) { return data.size(); }
  static T at(const std::vector<T>& data, size_t idx) { return data[idx]; }
  static T& at(std::vector<T>& data, size_t idx) { return data[idx]; }
  static void set(std::vector<T>& data, size_t idx, const T& value) {
    data[idx] = value;
  }
  static std::vector<T> split(std::vector<T>& data, size_t start, size_t end) {
    return std::vector<T>{data.begin() + start, data.begin() + end};
  }
};

#ifdef VT_KOKKOS_ENABLED
template <typename T, typename... Props>
class DataHandler<Kokkos::View<T*, Props...>> {
public:
  static size_t size(const Kokkos::View<T*, Props...>& data) {
    return data.extent(0);
  }
  static T at(const Kokkos::View<T*, Props...>& data, size_t idx) {
    return data(idx);
  }
  static T& at(Kokkos::View<T*, Props...>& data, size_t idx) {
    return data(idx);
  }
  static void
  set(Kokkos::View<T*, Props...>& data, size_t idx, const T& value) {
    data(idx) = value;
  }
};
#endif // VT_KOKKOS_ENABLED

@lifflander
Copy link
Collaborator

Let's go with the DataHandler approach.

@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch 5 times, most recently from 081570b to 84e8f72 Compare June 4, 2024 16:45
@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch from 84e8f72 to 34615ad Compare June 7, 2024 05:55
@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch 2 times, most recently from 5ce7cd9 to 168f39a Compare June 18, 2024 17:35
@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch from b354603 to 5441ffc Compare July 6, 2024 10:36
@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch 3 times, most recently from 5c4fed6 to eafa364 Compare July 17, 2024 21:40
…nd fix compile issues realted to using Kokkos::View for allreduce
@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch from eafa364 to f5e685b Compare July 18, 2024 13:32
@JacobDomagala JacobDomagala force-pushed the 2240-use-a-proper-all-reduce-algorithm-for-proxyallreduce branch from f5e685b to 57b8cab Compare July 18, 2024 15:27
@JacobDomagala
Copy link
Contributor Author

Closing this PR as #2337 contains updated code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a proper all-reduce algorithm for ObjGroup
2 participants