Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MPI to driver code #34

Closed
wants to merge 8 commits into from
Closed

Add MPI to driver code #34

wants to merge 8 commits into from

Conversation

tomdeakin
Copy link
Contributor

Add the ability to run BabelStream across multiple nodes with MPI. The input size is assumed to be per MPI rank, and therefore is similar to weak scaling. The reported bandwidth is of the entire parallel run.

All changes were made in the driver code, and no changes should
be required in the different implementations.

Changes involved:
1. Initilising MPI and getting rank and size
2. Guarding std::cout so only rank 0 prints
3. Adding Barriers between kernels
4. Dot produce performs Reduction.
@tomdeakin tomdeakin requested a review from jrprice June 20, 2017 15:45
Copy link
Contributor

@jrprice jrprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine overall. Successfully ran across 256 nodes with decent efficiency.

I've made some minor suggestions w.r.t. output, but nothing particularly important.

What are we going to do about the Makefiles? Do we want some standard way of enabling MPI? For OpenMP you can do it without touching the Makefile, but other models like CUDA need some work.

I can imagine a couple of possible ways of doing this. There could be some variable the user can set:

make -f OpenMP.make MPI=1

Or we could have a special target in each Makefile:

make -f OpenMP.make mpi

Or something else. Not really sure what's best, just brainstorming here.

main.cpp Outdated
std::streamsize ss = std::cout.precision();
std::cout << std::setprecision(1) << std::fixed
<< "Array size: " << ARRAY_SIZE*sizeof(T)*1.0E-6 << " MB"
<< " (=" << ARRAY_SIZE*sizeof(T)*1.0E-9 << " GB)" << std::endl;
std::cout << "Total size: " << 3.0*ARRAY_SIZE*sizeof(T)*1.0E-6 << " MB"
<< " (=" << 3.0*ARRAY_SIZE*sizeof(T)*1.0E-9 << " GB)" << std::endl;
std::cout.precision(ss);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth making these include the phrase "per rank" for these printouts when in MPI mode to make things crystal clear?

Could also show the actual total size, but not sure how interesting that is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a nice way to alter the string depending on the USE_MPI preprocessor define?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I think this is about as good at can get:

std::cout << std::setprecision(1) << std::fixed
#ifdef USE_MPI
  << "Array size (per MPI rank): "
#else
  << "Array size: "
#endif
  << "...";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cb92f94

@@ -179,6 +237,10 @@ void run()
check_solution<T>(num_times, a, b, c, sum);
Copy link
Contributor

@jrprice jrprice Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be in an #ifdef USE_MPI \n if(rank == 0) section as well - otherwise you get massively spammed if verification fails.

On a related note, we really need to sort out verification for the reduction as per #20, since this almost always fails now when using a large number of MPI ranks.

Copy link
Contributor Author

@tomdeakin tomdeakin Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verification printing fixed in 5a64ce1
#20 still needs to be addressed, although all this MPI code assumes T is double as I've used MPI_Double for messages. We should probably support MPI_Float too.

@@ -187,6 +249,7 @@ void run()
<< std::left << std::setw(12) << "Average" << std::endl;

std::cout << std::fixed;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but might be interesting to see these results per rank as well as the aggregate bandwidth. This would make it easier to see how well things scale when running on lots of nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could always run the benchmark with -np 1 (or run the non-MPI benchmark). We could also run the benchmark on a singe node first, and then print out the efficiency. This would probably need a bit of an overhaul in the way the driver works though. Not sure if it's worth the pain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant that it might be nice to print out the average per-rank bandwidths, I'm not talking about actually re-running on a single node.

e.g. If I run on a single P100 I get 550 GB/s. When I run on a cluster of 2000 P100s, I get some very large number, but do I still get 550 GB/s per rank?

To be honest I'm just being lazy - it's trivial for the user to open a calculator and do that division themselves. Feel free to ignore this comment :-)

double average = std::accumulate(timings[i].begin()+1, timings[i].end(), 0.0) / (double)(num_times - 1);
double average = std::accumulate(timings[i].begin()+1, timings[i].end(), 0.0);

#ifdef USE_MPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation inside this block looks a little off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a57fdec

@tomdeakin
Copy link
Contributor Author

Re the Makefile changes. I guess the ideal solution will depend on how the changes are required. The MPI=yes option looks nice rather than having a separate build option, however the latter will be better if substantial changes are required for some models. I guess I'll just try to fold in the MPI=yes option and see how far I get. We can always change later...

@tomdeakin
Copy link
Contributor Author

ef22cbc implements MPI=yes for the CUDA implementation. It builds the implementation to an object file, and then compiles this in with the main driver. If MPI=yes is set on the make invocation then the compiler switches out to mpicxx and adds the USE_MPI define to EXTRA_FLAGS.

Does this approach look OK? If so, I'll start to roll out similar changes to the other models.

@tomdeakin
Copy link
Contributor Author

Actually, piggybacking on EXTRA_FLAGS doesn't work if it's also set on the CLI... Need to rethink this!

@jrprice
Copy link
Contributor

jrprice commented Jun 29, 2017

Does this approach look OK? If so, I'll start to roll out similar changes to the other models.

Yeah looks fine to me, bar the EXTRA_FLAGS issue you raise. I guess it'll just need an extra internal flags variable, which is annoying.

@tomdeakin
Copy link
Contributor Author

Closing as we'd prefer standalone MPI implementation. Opening new issue to this effect.

@tomdeakin tomdeakin closed this Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants