Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

ray_ddp gpu issue #179

Open
Open
@JiahaoYao

Description

@JiahaoYao
ray::ImplicitFunc.train() (pid=27359, ip=172.31.59.24, repr=_inner_train)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trainable.py", line 360, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 404, in step
    self._report_thread_runner_error(block=True)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error
    raise e
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
    self._entrypoint()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
    return self._trainable_func(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
    output = fn()
  File "test_tune.py", line 37, in _inner_train
    trainer.fit(model)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 62, in launch
    ray_output = self.run_function_on_workers(
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 224, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/home/ray/default/ray_lightning/ray_lightning/util.py", line 62, in process_results
    ray.get(ready)
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=27475, ip=172.31.59.24, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7f2c3c105610>)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 356, in execute
    return fn(*args, **kwargs)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 256, in _wrapping_function
    self._strategy.set_cuda_device_if_used()
  File "/home/ray/default/ray_lightning/ray_lightning/ray_ddp.py", line 233, in set_cuda_device_if_used
    torch.cuda.set_device(self.root_device)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

gives CUDA error: invalid device ordinal

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions