Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

support pytorch lightning 1.7 #196

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

JiahaoYao
Copy link
Contributor

setup.py Outdated Show resolved Hide resolved
@sxjscience
Copy link

Checking the CI failure about

E       TypeError: __init__() got an unexpected keyword argument 'checkpoint_callback'

It might be related to these two lines:

@JiahaoYao
Copy link
Contributor Author

interesting, the other tests are hanging but they are good on the servers.

@JiahaoYao
Copy link
Contributor Author

It seems that the memory is not enough for the ubuntu-latest

  • the test after test_horovod does not start for test_tune.py
  • the release test hang on when it comes to running ray_ddp_example.py with Tune

Does that mean for the tune, there is OOM issue?

@JiahaoYao
Copy link
Contributor Author

[pytest on push/test_linux_ray_master_3]   ✅  Success - Install package
[pytest on push/test_linux_ray_master_3] ⭐  Run Test with Pytest
[pytest on push/test_linux_ray_master_3]   🐳  docker exec cmd=[bash --noprofile --norc -e -o pipefail /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning/workflow/4] user=
| /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning/ray_lightning/tests /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning
| ============================= test session starts ==============================
| platform linux -- Python 3.7.13, pytest-7.1.2, pluggy-1.0.0 -- /opt/hostedtoolcache/Python/3.7.13/x64/bin/python
| cachedir: .pytest_cache
| rootdir: /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning
collected 7 items
|
| test_tune.py::test_tune_iteration_ddp PASSED                             [ 14%]
| test_tune.py::test_tune_iteration_horovod PASSED                         [ 28%]
| test_tune.py::test_checkpoint_ddp PASSED                                 [ 42%]
| test_tune.py::test_checkpoint_horovod PASSED                             [ 57%]
| test_tune.py::test_checkpoint_ddp_gpu SKIPPED (test requires multi-G...) [ 71%]
| test_tune.py::test_checkpoint_horovod_gpu SKIPPED (test requires mul...) [ 85%]
| test_tune.py::test_tune_iteration_ddp_gpu SKIPPED (test requires mul...) [100%]
|
| =============================== warnings summary ===============================
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     LooseVersion(torch.__version__) >= LooseVersion('1.5.0') and
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     LooseVersion(torch.__version__) <= LooseVersion('1.6.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     _SYNC_BN_V3 = LooseVersion(torch.__version__) >= LooseVersion('1.6.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     _SYNC_BN_V4 = LooseVersion(torch.__version__) >= LooseVersion('1.9.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     tensorboard.__version__
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     ) < LooseVersion("1.15"):
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/util/placement_group.py:80: DeprecationWarning: placement_group parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
|     ).remote(self)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/_private/ray_option_utils.py:291: DeprecationWarning: Setting 'object_store_memory' for actors is deprecated since it doesn't actually reserve the required object store memory. Use object spilling that's enabled by default (https://docs.ray.io/en/master/ray-core/objects/object-spilling.html) instead to bypass the object store memory size limitation.
|     stacklevel=1,
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
|     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group_bundle_index parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
|     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group_capture_child_tasks parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
|     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
| ============================== slowest durations ===============================
| 12.07s call     ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 11.92s call     ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 9.61s call     ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 7.93s call     ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| 3.66s setup    ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 3.55s setup    ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 3.01s teardown ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 2.97s teardown ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 2.77s teardown ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 2.64s teardown ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| 2.56s setup    ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 2.46s setup    ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|
| (6 durations < 0.005s hidden.  Use -vv to show these durations.)
| ============= 4 passed, 3 skipped, 32 warnings in 68.14s (0:01:08) =============
[pytest on push/test_linux_ray_master_3]   ✅  Success - Test with Pytest

@marcosrdac
Copy link

I just tested this PR and it worked fine on my cluster (training on 12 GPUs).

@JiahaoYao
Copy link
Contributor Author

JiahaoYao commented Sep 29, 2022

Any ideas to fix this in the ci test?

Requested labels: ubuntu-latest
Job defined at: ray-project/ray_lightning/.github/workflows/test.yaml@refs/pull/196/merge
Waiting for a runner to pick up this job...

Is there a typo mentioned here (https://github.com/orgs/community/discussions/31587)?

Copy link
Contributor Author

@JiahaoYao JiahaoYao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready to run the ci test

@sxjscience
Copy link

The CI error does not seem to be related to the PR:

        if not _JSONARGPARSE_SIGNATURES_AVAILABLE:
            raise ModuleNotFoundError(
>               f"{_JSONARGPARSE_SIGNATURES_AVAILABLE}. Try `pip install -U 'jsonargparse[signatures]'`."
            )
E           ModuleNotFoundError: Requirement 'jsonargparse[signatures]>=4.12.0' not met, DistributionNotFound: The 'docstring-parser>=0.15; extra == "signatures"' distribution was not found and is required by jsonargparse. Try `pip install -U 'jsonargparse[signatures]'`.

/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/pytorch_lightning/cli.py:73: ModuleNotFoundError
=============================== warnings summary ===============================

@sxjscience
Copy link

sxjscience commented Oct 3, 2022

@JiahaoYao Any plan to finish this PR?

@JiahaoYao
Copy link
Contributor Author

the hanging issue still remains for the release test

== Status ==
Current time: 2022-10-03 17:49:05 (running for 00:14:54.03)
Memory usage on this node: 1.7/6.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/3.56 GiB heap, 0.0/1.78 GiB objects
Result logdir: /home/runner/ray_results/tune_mnist
Number of trials: 1/1 (1 RUNNING)
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
| Trial name              | status   | loc             |   batch_size |   layer_1 |   layer_2 |         lr |
|-------------------------+----------+-----------------+--------------+-----------+-----------+------------|
| train_mnist_97d14_00000 | RUNNING  | 10.1.0.228:5153 |           32 |        64 |        64 | 0.00670904 |
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
== Status ==
Current time: 2022-10-03 17:49:10 (running for 00:14:59.03)
Memory usage on this node: 1.7/6.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/3.56 GiB heap, 0.0/1.78 GiB objects
Result logdir: /home/runner/ray_results/tune_mnist
Number of trials: 1/1 (1 RUNNING)
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
| Trial name              | status   | loc             |   batch_size |   layer_1 |   layer_2 |         lr |
|-------------------------+----------+-----------------+--------------+-----------+-----------+------------|
| train_mnist_97d14_00000 | RUNNING  | 10.1.0.228:5153 |           32 |        64 |        64 | 0.00670904 |
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+

@aga-relation
Copy link

Any updates on this? :)

Copy link

@enochkan enochkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approval was a mistake. Sorry for any inconvenience caused.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants