Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for MPI_THREAD_MULTIPLE? #17

Open
rzambre opened this issue Jul 16, 2019 · 3 comments
Open

Plans for MPI_THREAD_MULTIPLE? #17

rzambre opened this issue Jul 16, 2019 · 3 comments

Comments

@rzambre
Copy link

rzambre commented Jul 16, 2019

https://github.com/tensorflow/networking/blob/master/tensorflow_networking/mpi/mpi_utils.cc#L56

I see the use of MPI_THREAD_MULTIPLE has been commented out. From my understanding of the current design of exchanging data with MPI, we do not require MPI_THREAD_MULTIPLE since a dedicated thread is responsible for communication.

Are there future plans of having multiple threads perform communication simultaneously (once MPI implementations better support MPI_THREAD_MULTIPLE of course)? If so, is it more likely that we have dedicated communication threads or is it possible that the computation threads also perform communication?

@jbedorf
Copy link
Member

jbedorf commented Jul 28, 2019

In an earlier version I indeed used MPI_THREAD_MULTIPLE to have multiple computation threads perform their own communication. Thereby reducing the load on the communication thread. It turned out to be too unstable at that point in time as the various MPI distributions would give random errors and deadlocks. I would be worthwhile to explore this again in a future version once the code has been converted to support the TF 2.0 C API.

@rzambre
Copy link
Author

rzambre commented Jul 30, 2019

I see. Do you remember which MPI libraries you experimented with?

With multiple threads participating in communication, there exits a design space which could explore the use of separate communicators, tags, etc. to expose parallel communication to the MPI library. Is there a communication kernel mini-application or microbenchmark that captures the communication pattern of Tensorflow? That would serve well to explore the performance of the different strategies in the design space of parallel MPI communication.

@rzambre
Copy link
Author

rzambre commented Oct 22, 2019

If a mini-app isn't available, I would be happy to help with writing a mini-app that captures the communication pattern of Tensorflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants