Error in RayStrategy.root_device when using multi GPU node #192
Description
Problem statement
When starting a hyperparameter search on a multi GPU node (4 GPUs) I run into a mismatch of visible CUDA devices. Below is the full code to recreate the error (it is the same as found here with a modification to use GPU as well as a change in local_dir
).
Code to recreate
from ray import tune
from ray_lightning import RayStrategy
from ray_lightning.examples.ray_ddp_example import MNISTClassifier
from ray_lightning.tune import TuneReportCallback, get_tune_resources
import pytorch_lightning as pl
def train_mnist(config):
# Create your PTL model.
model = MNISTClassifier(config)
# Create the Tune Reporting Callback
metrics = {"loss": "ptl/val_loss", "acc": "ptl/val_accuracy"}
callbacks = [TuneReportCallback(metrics, on="validation_end")]
trainer = pl.Trainer(
max_epochs=4,
callbacks=callbacks,
strategy=RayStrategy(use_gpu=True),
)
trainer.fit(model)
config = {
"layer_1": tune.choice([32, 64, 128]),
"layer_2": tune.choice([64, 128, 256]),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([32, 64, 128]),
}
# Make sure to pass in ``resources_per_trial`` using the ``get_tune_resources`` utility.
analysis = tune.run(
train_mnist,
metric="loss",
mode="min",
config=config,
num_samples=10,
resources_per_trial=get_tune_resources(use_gpu=True),
local_dir='~/scratch/raytune',
name="tune_mnist")
print("Best hyperparameters found were: ", analysis.best_config)
Modified ray_lightning
code
To highlight the error I am experiencing, i have modified code within ray_lightning.ray_dpp.RayStrategy.root_device
from
if cuda_visible_str and cuda_visible_str != "NoDevFiles":
cuda_visible_list = [
int(dev) for dev in cuda_visible_str.split(",")
]
device_id = cuda_visible_list.index(gpu_id)
return torch.device("cuda", device_id)
to
if cuda_visible_str and cuda_visible_str != "NoDevFiles":
cuda_visible_list = [
int(dev) for dev in cuda_visible_str.split(",")
]
try:
device_id = cuda_visible_list.index(gpu_id)
except ValueError as err:
raise ValueError(f'cuda_visible_str -> "{cuda_visible_str}", cuda_visible_list -> "{cuda_visible_list}", gpu_id -> "{gpu_id}",') from err
return torch.device("cuda", device_id)
Error log output
Below is the output from an error.txt
log in one of the trials:
Failure # 1 (occurred at 2022-08-05_17-02-02)
ray::ImplicitFunc.train() (pid=195537, ip=10.10.8.7, repr=train_mnist)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/trainable.py", line 360, in train
result = self.step()
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 404, in step
self._report_thread_runner_error(block=True)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error
raise e
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 277, in run
self._entrypoint()
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
return self._trainable_func(
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
output = fn()
File "/mnt/iusers01/gb01/c38028ml/Dev/git/phd-torchscripts/PhD/torchscripts/pcconv/tests/raytune_test.py", line 24, in train_mnist
trainer.fit(model)
File "/mnt/iusers01/gb01/c38028ml/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "/mnt/iusers01/gb01/c38028ml/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 719, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/ray_launcher.py", line 55, in launch
ray_output = self.run_function_on_workers(
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/ray_launcher.py", line 229, in run_function_on_workers
results = process_results(self._futures, self.tune_queue)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/util.py", line 64, in process_results
ray.get(ready)
ray.exceptions.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=195612, ip=10.10.8.7, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x2b876ccc0bb0>)
ValueError: '2' is not in list
The above exception was the direct cause of the following exception:
ray::RayExecutor.execute() (pid=195612, ip=10.10.8.7, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x2b876ccc0bb0>)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
return fn(*args, **kwargs)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/ray_launcher.py", line 267, in _wrapping_function
self._strategy._worker_setup(process_idx=global_rank)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/ray_ddp.py", line 155, in _worker_setup
self._process_group_backend = self._get_process_group_backend()
File "/mnt/iusers01/gb01/c38028ml/.local/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 163, in _get_process_group_backend
or get_default_process_group_backend_for_device(self.root_device)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/ray_ddp.py", line 258, in root_device
raise ValueError(f'cuda_visible_str -> "{cuda_visible_str}", cuda_visible_list -> "{cuda_visible_list}", gpu_id -> "{gpu_id}"') from err
ValueError: cuda_visible_str -> "2", cuda_visible_list -> "[2]", gpu_id -> "2"
Other trials fail with a similar error, differing only with the gpu_id
present, e.g.
ValueError: cuda_visible_str -> "0", cuda_visible_list -> "[0]", gpu_id -> "0"
Extra information
This code is submitted as a job in a HPC (SGE) server, and I can confirm that the main script has the environment variable ${CUDA_VISIBLE_DEVICES}
equal to "0,1,2,3"
.
It seems that for some reason each trial process only has access to one CUDA device, therefore indexing the cuda_visible_list
fails.
To try and find a solution I tried the following change to the code:
Change from:
device_id = cuda_visible_list.index(gpu_id)
to:
if len(cuda_visible_list) == 1:
device_id = cuda_visible_list[0]
else:
device_id = cuda_visible_list.index(gpu_id)
Which resulted in the following error:
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
return fn(*args, **kwargs)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/ray_launcher.py", line 271, in _wrapping_function
set_cuda_device_if_used(trainer.strategy)
File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/util.py", line 102, in set_cuda_device_if_used
torch.cuda.set_device(strategy.root_device)
File "/opt/apps/apps/binapps/pytorch/1.11.0/python3.9/gpu-cuda11.3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 313, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.