-
Notifications
You must be signed in to change notification settings - Fork 34
support pytorch lightning 1.7 #196
base: main
Are you sure you want to change the base?
Conversation
JiahaoYao
commented
Aug 11, 2022
- fix the issue [Code] add pytorch-lightning compatibility for 1.7.x #194
Checking the CI failure about
It might be related to these two lines:
|
interesting, the other tests are hanging but they are good on the servers. |
It seems that the memory is not enough for the
Does that mean for the tune, there is OOM issue? |
[pytest on push/test_linux_ray_master_3] ✅ Success - Install package
[pytest on push/test_linux_ray_master_3] ⭐ Run Test with Pytest
[pytest on push/test_linux_ray_master_3] 🐳 docker exec cmd=[bash --noprofile --norc -e -o pipefail /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning/workflow/4] user=
| /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning/ray_lightning/tests /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning
| ============================= test session starts ==============================
| platform linux -- Python 3.7.13, pytest-7.1.2, pluggy-1.0.0 -- /opt/hostedtoolcache/Python/3.7.13/x64/bin/python
| cachedir: .pytest_cache
| rootdir: /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning
collected 7 items
|
| test_tune.py::test_tune_iteration_ddp PASSED [ 14%]
| test_tune.py::test_tune_iteration_horovod PASSED [ 28%]
| test_tune.py::test_checkpoint_ddp PASSED [ 42%]
| test_tune.py::test_checkpoint_horovod PASSED [ 57%]
| test_tune.py::test_checkpoint_ddp_gpu SKIPPED (test requires multi-G...) [ 71%]
| test_tune.py::test_checkpoint_horovod_gpu SKIPPED (test requires mul...) [ 85%]
| test_tune.py::test_tune_iteration_ddp_gpu SKIPPED (test requires mul...) [100%]
|
| =============================== warnings summary ===============================
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| LooseVersion(torch.__version__) >= LooseVersion('1.5.0') and
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| LooseVersion(torch.__version__) <= LooseVersion('1.6.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| _SYNC_BN_V3 = LooseVersion(torch.__version__) >= LooseVersion('1.6.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| _SYNC_BN_V4 = LooseVersion(torch.__version__) >= LooseVersion('1.9.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| tensorboard.__version__
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| ) < LooseVersion("1.15"):
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/util/placement_group.py:80: DeprecationWarning: placement_group parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
| ).remote(self)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/_private/ray_option_utils.py:291: DeprecationWarning: Setting 'object_store_memory' for actors is deprecated since it doesn't actually reserve the required object store memory. Use object spilling that's enabled by default (https://docs.ray.io/en/master/ray-core/objects/object-spilling.html) instead to bypass the object store memory size limitation.
| stacklevel=1,
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
| return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group_bundle_index parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
| return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group_capture_child_tasks parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
| return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
| ============================== slowest durations ===============================
| 12.07s call ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 11.92s call ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 9.61s call ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 7.93s call ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| 3.66s setup ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 3.55s setup ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 3.01s teardown ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 2.97s teardown ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 2.77s teardown ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 2.64s teardown ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| 2.56s setup ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 2.46s setup ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|
| (6 durations < 0.005s hidden. Use -vv to show these durations.)
| ============= 4 passed, 3 skipped, 32 warnings in 68.14s (0:01:08) =============
[pytest on push/test_linux_ray_master_3] ✅ Success - Test with Pytest |
I just tested this PR and it worked fine on my cluster (training on 12 GPUs). |
Any ideas to fix this in the ci test? Requested labels: ubuntu-latest
Job defined at: ray-project/ray_lightning/.github/workflows/test.yaml@refs/pull/196/merge
Waiting for a runner to pick up this job... Is there a typo mentioned here (https://github.com/orgs/community/discussions/31587)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready to run the ci test
The CI error does not seem to be related to the PR:
|
@JiahaoYao Any plan to finish this PR? |
the hanging issue still remains for the release test == Status ==
Current time: 2022-10-03 17:49:05 (running for 00:14:54.03)
Memory usage on this node: 1.7/6.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/3.56 GiB heap, 0.0/1.78 GiB objects
Result logdir: /home/runner/ray_results/tune_mnist
Number of trials: 1/1 (1 RUNNING)
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
| Trial name | status | loc | batch_size | layer_1 | layer_2 | lr |
|-------------------------+----------+-----------------+--------------+-----------+-----------+------------|
| train_mnist_97d14_00000 | RUNNING | 10.1.0.228:5153 | 32 | 64 | 64 | 0.00670904 |
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
== Status ==
Current time: 2022-10-03 17:49:10 (running for 00:14:59.03)
Memory usage on this node: 1.7/6.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/3.56 GiB heap, 0.0/1.78 GiB objects
Result logdir: /home/runner/ray_results/tune_mnist
Number of trials: 1/1 (1 RUNNING)
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
| Trial name | status | loc | batch_size | layer_1 | layer_2 | lr |
|-------------------------+----------+-----------------+--------------+-----------+-----------+------------|
| train_mnist_97d14_00000 | RUNNING | 10.1.0.228:5153 | 32 | 64 | 64 | 0.00670904 |
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+ |
Any updates on this? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approval was a mistake. Sorry for any inconvenience caused.