-
Notifications
You must be signed in to change notification settings - Fork 91
Issues: aws-samples/awsome-distributed-training
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake'
#491
opened Nov 12, 2024 by
nghtm
FSDP sample fails with CUDA initialization error on HyperPod EKS
#467
opened Oct 28, 2024 by
shimomut
17.SM-modelparallelv2 uses pytorch binary that depends on deprecated conda packages
#457
opened Oct 9, 2024 by
junpuf
Organize SM-modelparallelv2 per orchestrator
enhancement
New feature or request
stale
#436
opened Sep 20, 2024 by
mhuguesaws
Pin NCCL and EFA version in FSDP
enhancement
New feature or request
stale
#435
opened Sep 20, 2024 by
mhuguesaws
Esm-1nv model not getting trained for more than 1 epochs
stale
#405
opened Aug 9, 2024 by
Tamizhisai
Add Ubuntu 22.04 support for ansible roles
enhancement
New feature or request
stale
#82
opened Dec 19, 2023 by
mhuguesaws
ProTip!
Find all open issues with in progress development work with linked:pr.