This repository was archived by the owner on Nov 3, 2023. It is now read-only.
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
ray_horovod
multi pid process in the run
#182
Open
Description
suspect: this probably the optimizer issue, the optimizers like adam and others, they store the first order and second order momentum, this would be messed up the process?
Also,
if we print the message in the run function (loops/
)
print(f"run entry")
#import traceback
#traceback.print_stack()
if self.skip:
return self.on_skip()
self.reset()
self.on_run_start(*args, **kwargs)
import os
print(f'{os.getpid()}')
count = 0
while not self.done:
try:
self.on_advance_start(*args, **kwargs)
self.advance(*args, **kwargs)
self.on_advance_end()
self._restarting = False
import os
print(f'i am in the {count} round, pid: {os.getpid()}')
from time import sleep
if count == 3:
sleep(100)
count += 1
except StopIteration:
break
self._restarting = False
output = self.on_run_end()
we will see that there will be three concurrent threads going through this function, the outputs looks like this
self.advance(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
self.advance(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
self.advance(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 196, in run
traceback.print_stack()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/workers/default_worker.py", line 238, in <module>
ray.worker.global_worker.main_loop()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 451, in main_loop
self.core_worker.run_task_loop()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 675, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
return method(self, *_args, **_kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/horovod/ray/worker.py", line 61, in execute
return func(self.executable)
File "/home/ray/anaconda3/lib/python3.8/site-packages/horovod/ray/runner.py", line 622, in <lambda>
f = lambda w: fn(*args, **kwargs)
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_horovod_launcher.py", line 111, in _func
return self._wrapping_function(function, model_ref, new_args, kwargs,
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_horovod_launcher.py", line 174, in _wrapping_function
results = function(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1358, in _run_train
self.fit_loop.run()
the high level bits are
some come from self.fit_loop.run()
(this is expected)
and some come from self.optimizer_loop.run(split_batch, optimizers, batch_idx)
(this is not expected).
Metadata
Metadata
Assignees
Labels
No labels