Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: test_vmap fails on multi-node runs on hardware accelerators #1627

Open
JuanPedroGHM opened this issue Aug 19, 2024 · 5 comments · May be fixed by #1738
Open

[Bug]: test_vmap fails on multi-node runs on hardware accelerators #1627

JuanPedroGHM opened this issue Aug 19, 2024 · 5 comments · May be fixed by #1738
Assignees
Labels
bug Something isn't working manipulations testing Implementation of tests, or test-related issues

Comments

@JuanPedroGHM
Copy link
Member

JuanPedroGHM commented Aug 19, 2024

What happened?

When running on more than one node and using GPUs at the same time, test_vmap fails. Needs further investigation.

Code snippet triggering the error

When running the test on Horeka using accelerated nodes, the test fails when running the test on 2 Nodes, with 3 or 4 ranks each.

HEAT_TEST_USE_DEVICE=gpu mpirun --report-bindings -N 3/4 pytest heat/core/tests/test_vmap.py

Error message or erroneous outcome

The result of the test does not match the expected outcome.

FAILED heat/core/tests/test_vmap.py::TestVmap::test_vmap - AssertionError: False is not true

Version

main (development branch)

Python version

3.11.2

PyTorch version

2.2.2

Cuda version

12.2

MPI version

OpenMPI 4.1, 5.0
mpi4py 3.1.6, 4.0.0
@JuanPedroGHM JuanPedroGHM added bug Something isn't working testing Implementation of tests, or test-related issues manipulations labels Aug 19, 2024
@mrfh92
Copy link
Collaborator

mrfh92 commented Aug 20, 2024

@JuanPedroGHM That's interesting. Actually, I have just used vmap on up to 12 GPU-nodes without any problems. Is this problem related to OpenMPI >= 4.1 specifically?

@JuanPedroGHM
Copy link
Member Author

No, this is with OpenMPI 4.1 and mpi4py 3.1.6. I updated the description of the issue with the specific dependencies and configuration where it fails.

Copy link
Contributor

This issue is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale label Oct 21, 2024
@mrfh92
Copy link
Collaborator

mrfh92 commented Dec 4, 2024

Can reproduce this with Heat 1.6-dev, PyTorch 2.5.1+cu124, mpi4py 4.0.1, OpenMPI 4.1.2

@mrfh92 mrfh92 self-assigned this Dec 4, 2024
Copy link
Contributor

github-actions bot commented Dec 4, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working manipulations testing Implementation of tests, or test-related issues
Development

Successfully merging a pull request may close this issue.

2 participants