-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpirun protocol - distributed training with @remote decorator #4998
Conversation
src/sagemaker/remote_function/runtime_environment/mpi_utils_remote.py
Fixed
Show resolved
Hide resolved
Is there any related documentation for remote decorator that should be updated for this change? under https://github.com/aws/sagemaker-python-sdk/tree/master/doc |
Classes are properly commented for having the documentation aligned with changes. This is the reference documentation (See |
Issue #, if available:
Description of changes: Introduced mpirun protocol for distributed training with multiple instances (instance_count > 1) with with remote decorator. mpirun protocol is the alternative to the torchrun protocol, introduced with PR merged #4984
Testing done: Unit tests mpirun single node with GPU, single node with multiple GPUs, multi node with multiple GPUs. Added new test cases for mpi_tuils_remote.py
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.