Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot implement Python without MPI_Probe #3

Closed
dalcinl opened this issue Jun 25, 2024 · 9 comments
Closed

Cannot implement Python without MPI_Probe #3

dalcinl opened this issue Jun 25, 2024 · 9 comments

Comments

@dalcinl
Copy link
Collaborator

dalcinl commented Jun 25, 2024

To implement the equivalent of pickle-based mpi4py communication, I really need some sort of probe functionality. The minimum required would be an implementation of MPI_Probe that correctly sets the status field such that querying for source and message count works as expected.

@dalcinl
Copy link
Collaborator Author

dalcinl commented Jun 25, 2024

Well, on a second though, I could get away with an extra message to communicate message size. Doing this, I'm seeing way higher intranode latency respect to mpi4py (which is expected), but it may also be that I'm not using Cython to wrap mpicd, thus Python overhead may be weighing in.

@jtronge
Copy link
Owner

jtronge commented Jul 1, 2024

I pushed some code that implements MPI_Probe. Let me know if you're able to use that or if something breaks.

@dalcinl
Copy link
Collaborator Author

dalcinl commented Jul 2, 2024

Is the following println! a debug leftover?

diff --git a/mpicd/src/context.rs b/mpicd/src/context.rs
index 86a1d66..75b54e7 100644
--- a/mpicd/src/context.rs
+++ b/mpicd/src/context.rs
@@ -134,7 +134,7 @@ impl Communicator for Context {
             } else {
                 (encode_tag(0, 0, tag), PROBE_TAG_MASK)
             };
-            println!("tag: {:x}, tag_mask: {:x}", tag, tag_mask);
+            //println!("tag: {:x}, tag_mask: {:x}", tag, tag_mask);
             // Note the loop count is arbitrary -- maybe need to make this configurable?
             for _ in 0..8192 {
                 ucp_worker_progress(handle.system.worker);

@dalcinl
Copy link
Collaborator Author

dalcinl commented Jul 2, 2024

@jtronge Slightly off-topic... I cannot build examples without the following patch:

diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index 7fc63ee..89b7708 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -7,5 +7,5 @@ add_executable(osu_bw osu_bw.c)
 add_executable(probe probe.c)
 
 foreach(BIN hello_world datatype0 datatype1 ring regions osu_bw probe)
-    target_link_libraries(${BIN} PUBLIC mpicd-capi)
+    target_link_libraries(${BIN} PUBLIC mpicd_capi)
 endforeach()

@dalcinl
Copy link
Collaborator Author

dalcinl commented Jul 2, 2024

Any idea what could be wrong with the run below?

$ mpiexec -n 2 python examples/pingpong.py -m 100000 -n 4000000
# MPI PingPong Test
# Size [B]  Bandwidth [MB/s] | Time Mean [s] ± StdDev [s]  Samples
    131072           1209.57 | 1.0836262e-04 ± 1.0114e-05     1000
    262144            569.76 | 4.6009702e-04 ± 2.2190e-05     1000
    524288            597.38 | 8.7765173e-04 ± 6.5160e-06     1000
   1048576            614.36 | 1.7067841e-03 ± 1.2250e-05     1000
thread '<unnamed>' panicked at mpicd-capi/src/p2p.rs:157:51:
missing matching message for probe: NoProbeMessage
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
[1719906419.934067] [kw61149:573459:0]          ucp_ep.c:1504 UCX  ERROR ep 0x7f40dba07000: error 'Connection reset by remote peer' on tcp/enp107s0 will not be handled since no error callback is installed

@jtronge
Copy link
Owner

jtronge commented Jul 2, 2024

Is the following println! a debug leftover?

diff --git a/mpicd/src/context.rs b/mpicd/src/context.rs
index 86a1d66..75b54e7 100644
--- a/mpicd/src/context.rs
+++ b/mpicd/src/context.rs
@@ -134,7 +134,7 @@ impl Communicator for Context {
             } else {
                 (encode_tag(0, 0, tag), PROBE_TAG_MASK)
             };
-            println!("tag: {:x}, tag_mask: {:x}", tag, tag_mask);
+            //println!("tag: {:x}, tag_mask: {:x}", tag, tag_mask);
             // Note the loop count is arbitrary -- maybe need to make this configurable?
             for _ in 0..8192 {
                 ucp_worker_progress(handle.system.worker);

Oh sorry, yes I forgot to remove that.

@jtronge Slightly off-topic... I cannot build examples without the following patch:

diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index 7fc63ee..89b7708 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -7,5 +7,5 @@ add_executable(osu_bw osu_bw.c)
 add_executable(probe probe.c)
 
 foreach(BIN hello_world datatype0 datatype1 ring regions osu_bw probe)
-    target_link_libraries(${BIN} PUBLIC mpicd-capi)
+    target_link_libraries(${BIN} PUBLIC mpicd_capi)
 endforeach()

Hmm, what version of cmake and C compiler are you running? I just tried that on my side and it gave me a linker error.

Any idea what could be wrong with the run below?

$ mpiexec -n 2 python examples/pingpong.py -m 100000 -n 4000000
# MPI PingPong Test
# Size [B]  Bandwidth [MB/s] | Time Mean [s] ± StdDev [s]  Samples
    131072           1209.57 | 1.0836262e-04 ± 1.0114e-05     1000
    262144            569.76 | 4.6009702e-04 ± 2.2190e-05     1000
    524288            597.38 | 8.7765173e-04 ± 6.5160e-06     1000
   1048576            614.36 | 1.7067841e-03 ± 1.2250e-05     1000
thread '<unnamed>' panicked at mpicd-capi/src/p2p.rs:157:51:
missing matching message for probe: NoProbeMessage
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
[1719906419.934067] [kw61149:573459:0]          ucp_ep.c:1504 UCX  ERROR ep 0x7f40dba07000: error 'Connection reset by remote peer' on tcp/enp107s0 will not be handled since no error callback is installed

There might be a bug in my code. I'll try your example and see if I can fix that.

@jtronge
Copy link
Owner

jtronge commented Jul 2, 2024

Ok, I fixed a bug in my probe code and I got your example to run on my side.

@dalcinl
Copy link
Collaborator Author

dalcinl commented Jul 2, 2024

Hmm, what version of cmake and C compiler are you running? I just tried that on my side and it gave me a linker error.

I'm using cmake 3.28.2 and rurstc/cargo 1.79.0;
The library is named libmpicd_capi.so, so I'm not sure why you are using mpicd-capi.
Maybe this is some oddity of the corrosion stuff you are using?

@jtronge
Copy link
Owner

jtronge commented Jul 2, 2024

I think you're right about corrosion causing the problem here. I found this issue: corrosion-rs/corrosion#501. Updating the corrosion version seems to allow underscores on my end now.

@dalcinl dalcinl closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants