This repository was archived by the owner on Nov 3, 2023. It is now read-only.
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
ray_ddp
gpu issue #179
Open
Description
ray::ImplicitFunc.train() (pid=27359, ip=172.31.59.24, repr=_inner_train)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trainable.py", line 360, in train
result = self.step()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 404, in step
self._report_thread_runner_error(block=True)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error
raise e
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
self._entrypoint()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
return self._trainable_func(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
output = fn()
File "test_tune.py", line 37, in _inner_train
trainer.fit(model)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 62, in launch
ray_output = self.run_function_on_workers(
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 224, in run_function_on_workers
results = process_results(self._futures, self.tune_queue)
File "/home/ray/default/ray_lightning/ray_lightning/util.py", line 62, in process_results
ray.get(ready)
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=27475, ip=172.31.59.24, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7f2c3c105610>)
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 356, in execute
return fn(*args, **kwargs)
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 256, in _wrapping_function
self._strategy.set_cuda_device_if_used()
File "/home/ray/default/ray_lightning/ray_lightning/ray_ddp.py", line 233, in set_cuda_device_if_used
torch.cuda.set_device(self.root_device)
File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gives CUDA error: invalid device ordinal
Metadata
Metadata
Assignees
Labels
No labels