Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

ray_horovod multi pid process in the run #182

Open
@JiahaoYao

Description

@JiahaoYao

suspect: this probably the optimizer issue, the optimizers like adam and others, they store the first order and second order momentum, this would be messed up the process?

Also,

if we print the message in the run function (loops/)

        print(f"run entry")
        #import traceback
        #traceback.print_stack()
        if self.skip:
            return self.on_skip()

        self.reset()

        self.on_run_start(*args, **kwargs)


        import os
        print(f'{os.getpid()}')
        count = 0
        while not self.done:
            try:
                self.on_advance_start(*args, **kwargs)
                self.advance(*args, **kwargs)
                self.on_advance_end()
                self._restarting = False
                import os
                print(f'i am in the {count} round, pid: {os.getpid()}')
                from time import sleep
                if count == 3:
                    sleep(100)
                count += 1
            except StopIteration:
                break
        self._restarting = False

        output = self.on_run_end()

we will see that there will be three concurrent threads going through this function, the outputs looks like this

    self.advance(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
    self.advance(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
    self.advance(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 196, in run
    traceback.print_stack()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/workers/default_worker.py", line 238, in <module>
    ray.worker.global_worker.main_loop()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 451, in main_loop
    self.core_worker.run_task_loop()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 675, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/horovod/ray/worker.py", line 61, in execute
    return func(self.executable)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/horovod/ray/runner.py", line 622, in <lambda>
    f = lambda w: fn(*args, **kwargs)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_horovod_launcher.py", line 111, in _func
    return self._wrapping_function(function, model_ref, new_args, kwargs,
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_horovod_launcher.py", line 174, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1358, in _run_train
    self.fit_loop.run()

the high level bits are

some come from self.fit_loop.run() (this is expected)

and some come from self.optimizer_loop.run(split_batch, optimizers, batch_idx) (this is not expected).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions