-
Notifications
You must be signed in to change notification settings - Fork 34
Error in RayStrategy.root_device when using multi GPU node #192
Comments
hi @m-lyon , i am trying to looking into the problem. I wonder if it is possible to remove |
i am worried about that |
== Status ==
Current time: 2022-08-05 16:13:21 (running for 00:01:46.10)
Memory usage on this node: 22.1/186.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 8.0/48 CPUs, 4.0/4 GPUs, 0.0/120.88 GiB heap, 0.0/55.8 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /home/ray/scratch/raytune/tune_mnist
Number of trials: 10/10 (6 PENDING, 4 RUNNING)
+-------------------------+----------+--------------------+--------------+-----------+-----------+-------------+
| Trial name | status | loc | batch_size | layer_1 | layer_2 | lr |
|-------------------------+----------+--------------------+--------------+-----------+-----------+-------------|
| train_mnist_f3985_00000 | RUNNING | 172.31.91.173:2066 | 64 | 128 | 128 | 0.000301458 |
| train_mnist_f3985_00001 | RUNNING | 172.31.91.173:2132 | 32 | 64 | 128 | 0.0120597 |
| train_mnist_f3985_00002 | RUNNING | 172.31.91.173:2134 | 64 | 128 | 256 | 0.00052504 |
| train_mnist_f3985_00003 | RUNNING | 172.31.91.173:2136 | 32 | 64 | 128 | 0.00301416 |
| train_mnist_f3985_00004 | PENDING | | 128 | 64 | 64 | 0.0749624 |
| train_mnist_f3985_00005 | PENDING | | 64 | 32 | 128 | 0.000440828 |
| train_mnist_f3985_00006 | PENDING | | 32 | 32 | 256 | 0.00521079 |
| train_mnist_f3985_00007 | PENDING | | 32 | 64 | 64 | 0.00171421 |
| train_mnist_f3985_00008 | PENDING | | 64 | 128 | 64 | 0.00174926 |
| train_mnist_f3985_00009 | PENDING | | 128 | 64 | 64 | 0.00142737 |
+-------------------------+----------+--------------------+--------------+-----------+-----------+-------------+
Epoch 0: 12%|█▏ | 108/937 [00:00<00:06, 129.34it/s, loss=0.85, v_num=0]
Epoch 0: 11%|█ | 99/937 [00:00<00:07, 118.54it/s, loss=0.561, v_num=0]
Epoch 0: 7%|▋ | 135/1874 [00:01<00:14, 124.06it/s, loss=0.432, v_num=0]
Epoch 0: 7%|▋ | 133/1874 [00:01<00:14, 121.89it/s, loss=0.466, v_num=0]
Epoch 0: 13%|█▎ | 122/937 [00:00<00:06, 130.38it/s, loss=0.719, v_num=0]
Epoch 0: 12%|█▏ | 113/937 [00:00<00:06, 120.26it/s, loss=0.49, v_num=0]
Epoch 0: 8%|▊ | 150/1874 [00:01<00:13, 125.89it/s, loss=0.446, v_num=0]
Epoch 0: 8%|▊ | 149/1874 [00:01<00:13, 124.88it/s, loss=0.427, v_num=0]
Epoch 0: 15%|█▍ | 136/937 [00:01<00:06, 131.28it/s, loss=0.661, v_num=0]
Epoch 0: 14%|█▎ | 127/937 [00:01<00:06, 122.10it/s, loss=0.425, v_num=0]
Epoch 0: 9%|▉ | 166/1874 [00:01<00:13, 128.13it/s, loss=0.47, v_num=0]
Epoch 0: 9%|▉ | 165/1874 [00:01<00:13, 127.30it/s, loss=0.469, v_num=0]
Epoch 0: 15%|█▍ | 137/937 [00:01<00:06, 131.29it/s, loss=0.666, v_num=0]
Epoch 0: 9%|▉ | 166/1874 [00:01<00:13, 127.45it/s, loss=0.469, v_num=0]
Epoch 0: 9%|▉ | 166/1874 [00:01<00:13, 127.40it/s, loss=0.461, v_num=0]
Epoch 0: 14%|█▎ | 128/937 [00:01<00:06, 122.19it/s, loss=0.432, v_num=0]
Epoch 0: 9%|▉ | 167/1874 [00:01<00:13, 128.27it/s, loss=0.47, v_num=0]
Epoch 0: 9%|▉ | 167/1874 [00:01<00:13, 128.24it/s, loss=0.47, v_num=0]
Epoch 0: 16%|█▌ | 152/937 [00:01<00:05, 132.45it/s, loss=0.594, v_num=0]
Epoch 0: 15%|█▌ | 143/937 [00:01<00:06, 124.40it/s, loss=0.429, v_num=0] Hi @m-lyon , i can successfully run the jobs. |
this is my gpu usage (base) ray@ip-172-31-91-173:~/default$ nvidia-smi
Fri Aug 5 16:12:58 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 42C P0 26W / 70W | 503MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 43C P0 27W / 70W | 511MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 43C P0 26W / 70W | 501MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 42C P0 27W / 70W | 489MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6710 C 501MiB |
| 1 N/A N/A 6708 C 509MiB |
| 2 N/A N/A 6393 C 499MiB |
| 3 N/A N/A 6709 C 487MiB |
+-----------------------------------------------------------------------------+ |
issue_192.mp4 |
I did a run where I set
I suppose if you're unable to recreate the error then I'm going to have to do some further digging into the codebase to try and find the cause of the problem. To be honest i'm unsure why |
Hi @m-lyon, may i ask several questions:
the reason of or in other words, the ray tune will separate the cuda visible environments:
in this case, it should use |
also, i wonder if it is possible for u to try a clean and empty aws or gcp machine. sometimes n a HPC (SGE) server might have their different environment setups. |
Using
As in have I just used
I'm a university student using the HPC system provided so I don't really have the financial means to use a commercial setup to conduct the experiments i'm trying to run unfortunately. |
Your change fails due to |
by the way, can u also show ray status |
for the second item, i mean |
So would this work? if len(cuda_visible_list) == 1:
return torch.device('cuda:0') |
I can confirm the above code has fixed this issue. I guess a question is then given the code below, if each trial should only see gpu_id = ray.get_gpu_ids()[0]
cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
cuda_visible_list = [
int(dev) for dev in cuda_visible_str.split(",")
]
device_id = cuda_visible_list.index(gpu_id)
return torch.device("cuda", device_id) |
can u also print `nvidia-sim` in the command line during the training and
can u find all the gpus are used?
…On Sun, Aug 7, 2022, 04:21 Matt ***@***.***> wrote:
I can confirm the above code has fixed this issue.
I guess a question is then given the code below, if each trial should only
see CUDA_VISIBLE_DEVICES=0 (because ray sets this for each trial) then
how does this ever not produce an error for trials in parallel where NGPUs
> 1? Because gpu_id and cuda_visible_list are both not 0 when the desired
device is torch.device('cuda', 0)
gpu_id = ray.get_gpu_ids()[0]cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")cuda_visible_list = [
int(dev) for dev in cuda_visible_str.split(",")
]device_id = cuda_visible_list.index(gpu_id)return torch.device("cuda", device_id)
—
Reply to this email directly, view it on GitHub
<#192 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE7QK4MOX24IIWMZ6NU4CELVX6LZ3ANCNFSM55WVFBKA>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Hi @m-lyon , wonder whether your issue is solved? |
As far as I can tell, setting each device to Specifically, as it is a batched submission system, I'm unable to open an interactive terminal on the same processing node as the one that is running the raytune job. Having said that though training times are indicative of GPU use, so it seems that this is the case (CPU training would obviously be much much slower). |
After some time training with the aforementioned code fix, I ran into this error for some of the trials:
This was with the additional configuration of using Afterwards I thought I would try DDP across the 4 GPUs available on the node, but in this configuration the original error was raised:
To me this seems very odd, as the visible list clearly contains GPU ID 0. In any case; |
So the above error hinted at what the problem was here. From the raytune GPU documentation:
Crucially it is a list of strings, whereas in cuda_visible_list = [
int(dev) for dev in cuda_visible_str.split(",")
] So it's a type mismatch. I wonder how this does not cause an error every time though? Possible fix: changing |
Update:after implementing the above fix I immediately run into the CUDA error:
For reference, here is the algo = TuneBOHB(metric="loss", mode="min")
bohb = HyperBandForBOHB(
time_attr="training_iteration",
metric="loss",
mode="min",
max_t=30
)
analysis = tune.run(
train_model_raytune,
config=configuration,
scheduler=bohb,
search_alg=algo,
num_samples=30,
resources_per_trial=get_tune_resources(num_workers=4, use_gpu=True),
local_dir='~/scratch/raytune',
name='initial_raytune_test3',
) and the callbacks = [TuneReportCallback({'loss': 'val_loss'}, on='validation_end')]
trainer = pl.Trainer(
max_epochs=MAX_EPOCHS,
callbacks=callbacks,
strategy=RayStrategy(num_workers=4, use_gpu=True),
) |
Problem statement
When starting a hyperparameter search on a multi GPU node (4 GPUs) I run into a mismatch of visible CUDA devices. Below is the full code to recreate the error (it is the same as found here with a modification to use GPU as well as a change in
local_dir
).Code to recreate
Modified
ray_lightning
codeTo highlight the error I am experiencing, i have modified code within
ray_lightning.ray_dpp.RayStrategy.root_device
fromto
Error log output
Below is the output from an
error.txt
log in one of the trials:Other trials fail with a similar error, differing only with the
gpu_id
present, e.g.Extra information
This code is submitted as a job in a HPC (SGE) server, and I can confirm that the main script has the environment variable
${CUDA_VISIBLE_DEVICES}
equal to"0,1,2,3"
.It seems that for some reason each trial process only has access to one CUDA device, therefore indexing the
cuda_visible_list
fails.To try and find a solution I tried the following change to the code:
Change from:
to:
Which resulted in the following error:
The text was updated successfully, but these errors were encountered: