Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Error in RayStrategy.root_device when using multi GPU node #192

Open
@m-lyon

Description

@m-lyon

Problem statement

When starting a hyperparameter search on a multi GPU node (4 GPUs) I run into a mismatch of visible CUDA devices. Below is the full code to recreate the error (it is the same as found here with a modification to use GPU as well as a change in local_dir).

Code to recreate

from ray import tune

from ray_lightning import RayStrategy
from ray_lightning.examples.ray_ddp_example import MNISTClassifier
from ray_lightning.tune import TuneReportCallback, get_tune_resources

import pytorch_lightning as pl


def train_mnist(config):

    # Create your PTL model.
    model = MNISTClassifier(config)

    # Create the Tune Reporting Callback
    metrics = {"loss": "ptl/val_loss", "acc": "ptl/val_accuracy"}
    callbacks = [TuneReportCallback(metrics, on="validation_end")]

    trainer = pl.Trainer(
        max_epochs=4,
        callbacks=callbacks,
        strategy=RayStrategy(use_gpu=True),
    )
    trainer.fit(model)

config = {
    "layer_1": tune.choice([32, 64, 128]),
    "layer_2": tune.choice([64, 128, 256]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([32, 64, 128]),
}

# Make sure to pass in ``resources_per_trial`` using the ``get_tune_resources`` utility.
analysis = tune.run(
        train_mnist,
        metric="loss",
        mode="min",
        config=config,
        num_samples=10,
        resources_per_trial=get_tune_resources(use_gpu=True),
        local_dir='~/scratch/raytune',
        name="tune_mnist")
        
print("Best hyperparameters found were: ", analysis.best_config)

Modified ray_lightning code

To highlight the error I am experiencing, i have modified code within ray_lightning.ray_dpp.RayStrategy.root_device from

                if cuda_visible_str and cuda_visible_str != "NoDevFiles":
                    cuda_visible_list = [
                        int(dev) for dev in cuda_visible_str.split(",")
                    ]
                    device_id = cuda_visible_list.index(gpu_id)
                    return torch.device("cuda", device_id)

to

                if cuda_visible_str and cuda_visible_str != "NoDevFiles":
                    cuda_visible_list = [
                        int(dev) for dev in cuda_visible_str.split(",")
                    ]
                    try:
                        device_id = cuda_visible_list.index(gpu_id)
                    except ValueError as err:
                        raise ValueError(f'cuda_visible_str -> "{cuda_visible_str}", cuda_visible_list -> "{cuda_visible_list}", gpu_id -> "{gpu_id}",') from err
                    return torch.device("cuda", device_id)

Error log output

Below is the output from an error.txt log in one of the trials:

Failure # 1 (occurred at 2022-08-05_17-02-02)
ray::ImplicitFunc.train() (pid=195537, ip=10.10.8.7, repr=train_mnist)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/trainable.py", line 360, in train
    result = self.step()
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 404, in step
    self._report_thread_runner_error(block=True)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error
    raise e
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 277, in run
    self._entrypoint()
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
    return self._trainable_func(
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
    output = fn()
  File "/mnt/iusers01/gb01/c38028ml/Dev/git/phd-torchscripts/PhD/torchscripts/pcconv/tests/raytune_test.py", line 24, in train_mnist
    trainer.fit(model)
  File "/mnt/iusers01/gb01/c38028ml/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/mnt/iusers01/gb01/c38028ml/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 719, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/ray_launcher.py", line 55, in launch
    ray_output = self.run_function_on_workers(
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/ray_launcher.py", line 229, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/util.py", line 64, in process_results
    ray.get(ready)
ray.exceptions.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=195612, ip=10.10.8.7, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x2b876ccc0bb0>)
ValueError: '2' is not in list

The above exception was the direct cause of the following exception:

ray::RayExecutor.execute() (pid=195612, ip=10.10.8.7, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x2b876ccc0bb0>)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
    return fn(*args, **kwargs)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/ray_launcher.py", line 267, in _wrapping_function
    self._strategy._worker_setup(process_idx=global_rank)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/ray_ddp.py", line 155, in _worker_setup
    self._process_group_backend = self._get_process_group_backend()
  File "/mnt/iusers01/gb01/c38028ml/.local/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 163, in _get_process_group_backend
    or get_default_process_group_backend_for_device(self.root_device)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/ray_ddp.py", line 258, in root_device
    raise ValueError(f'cuda_visible_str -> "{cuda_visible_str}", cuda_visible_list -> "{cuda_visible_list}", gpu_id -> "{gpu_id}"') from err
ValueError: cuda_visible_str -> "2", cuda_visible_list -> "[2]", gpu_id -> "2"

Other trials fail with a similar error, differing only with the gpu_id present, e.g.

ValueError: cuda_visible_str -> "0", cuda_visible_list -> "[0]", gpu_id -> "0"

Extra information

This code is submitted as a job in a HPC (SGE) server, and I can confirm that the main script has the environment variable ${CUDA_VISIBLE_DEVICES} equal to "0,1,2,3".

It seems that for some reason each trial process only has access to one CUDA device, therefore indexing the cuda_visible_list fails.

To try and find a solution I tried the following change to the code:

Change from:

device_id = cuda_visible_list.index(gpu_id)

to:

if len(cuda_visible_list) == 1:
  device_id = cuda_visible_list[0]
else:
  device_id = cuda_visible_list.index(gpu_id)

Which resulted in the following error:

  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
    return fn(*args, **kwargs)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/launchers/ray_launcher.py", line 271, in _wrapping_function
    set_cuda_device_if_used(trainer.strategy)
  File "/mnt/iusers01/gb01/c38028ml/.conda/envs/torch/lib/python3.9/site-packages/ray_lightning/util.py", line 102, in set_cuda_device_if_used
    torch.cuda.set_device(strategy.root_device)
  File "/opt/apps/apps/binapps/pytorch/1.11.0/python3.9/gpu-cuda11.3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 313, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions