-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove AWS Pytorch channel in examples #453
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lef comments
@junpuf why is this marked as a draft? can we mark it ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. In the future let's remove the aws-ofi-nccl
and hwloc
dependencies.
Issue #, if available:
#444
Description of changes:
The AWS PyTorch Conda channel is being deprecated, future development will be stopped, so removing the usage of it.
Testing
Test infra is 2 p4d using pcluster with fsx and slurm built using DLAMI (the same AMI used by HyperPod AMI).
10.FSDP
16.pytorch-cpu-ddp
17.SM-modelparallelv2
For 17.SM-modelparallelv2, it seems
pytorch="2.2.0=sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0
declared dependency onaws-ofi-nccl >=1.7.1,<2.0
(probably due to copying the build recipe from aws conda channel). Because of this, i made a workaround by supplying the 2 binaries needed for this pytorch package.Workaround included 2 binaries (details below) required in a new
bin
directory inside the example directory.20.FSDP-Mamba
Simply removed commented lines that referenced the AWS PyTorch conda channel, no test necessary
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.