Multi-GPU training fails when using GCP Deep Learning image #6812

minimaxir · 2021-04-03T21:24:16Z

🐛 Bug

Multi-GPU training fails when using GCP Deep Learning image. Occurs when using terminal. Occurs with dp and ddp_spawn accelerators; does not occur with a ddp accelerator. Does not occur when using the same system for single-GPU training.

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: Y
ou requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2
")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: Y
ou requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2
")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 114, in _main
    prepare(preparation_data)
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "/opt/conda/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/opt/conda/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/max/boring_model_multigpu_(2).py", line 88, in <module>
    trainer.fit(model, train, val)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in f
it
    self.dispatch()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in d
ispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 
73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py"
, line 108, in start_training
    mp.spawn(self.new_process, **self.mp_spawn_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 179, in start_p
rocesses
    process.start()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.
        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:
            if __name__ == '__main__':
                freeze_support()
                ...
        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

To Reproduce

Create a GCP VM w/ the following properties:

n1-highmem-2, 4 T4 GPUs
Deep Learning on Linux OS, PyTorch 1.8 m110 m66 version
Allow HTTP/HTTPS Traffic, Preemptible On

After SSHing into the system and installing CUDA drivers (may need to run sudo /opt/deeplearning/install-driver.sh), install pytorch-lightning via pip3 install pytorch-lightning.

Then run:

import os

import torch

from torch.utils.data import DataLoader, Dataset
import pytorch_lightning as pl
from pytorch_lightning import LightningModule

tmpdir = os.getcwd()


class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x["x"] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log("fake_test_acc", loss)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


num_samples = 10000

train = RandomDataset(32, num_samples)
train = DataLoader(train, batch_size=32)

val = RandomDataset(32, num_samples)
val = DataLoader(val, batch_size=32)

test = RandomDataset(32, num_samples)
test = DataLoader(test, batch_size=32)


model = BoringModel()

# Initialize a trainer
trainer = pl.Trainer(max_epochs=1, progress_bar_refresh_rate=20, gpus=4)

# Train the model ⚡
trainer.fit(model, train, val)

Expected behavior

Environment

CUDA:
- GPU:
- Tesla T4
- Tesla T4
- Tesla T4
- Tesla T4
- available: True
- version: 11.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.8.0
- pytorch-lightning: 1.2.6
- tqdm: 4.59.0
System:
- OS: Linux
- architecture:
- 64bit
-
- processor:
- python: 3.7.10
- version: Proposal for help #1 SMP Debian 4.19.181-1 (2021-03-19)

Additional context

This was the issue I hit when debugging minimaxir/aitextgen#103; since it occurs with the BoringModel it may not be aitextgen's fault (maybe) cc @SeanNaren

The text was updated successfully, but these errors were encountered:

kaushikb11 · 2021-04-07T07:45:35Z

Hi @minimaxir!

The Error is not specific to the GCP Deep Learning Image. As mentioned in the trace, you would need to have if __name__ == '__main__': in your script for the spawn to make it work.

Required changes to your Script:

import os

import torch

from torch.utils.data import DataLoader, Dataset
import pytorch_lightning as pl
from pytorch_lightning import LightningModule

tmpdir = os.getcwd()


class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x["x"] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log("fake_test_acc", loss)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]

def test_run():
    num_samples = 10000

    train = RandomDataset(32, num_samples)
    train = DataLoader(train, batch_size=32)

    val = RandomDataset(32, num_samples)
    val = DataLoader(val, batch_size=32)

    test = RandomDataset(32, num_samples)
    test = DataLoader(test, batch_size=32)


    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(max_epochs=1, progress_bar_refresh_rate=20, gpus=4, accelerator="ddp_spawn")

    # Train the model ⚡
    trainer.fit(model, train, val)
    trainer.test(test_dataloaders=test)
    
if __name__ == '__main__':
    test_run()

minimaxir · 2021-04-10T21:29:14Z

Thanks for the clarification! I can verify that script works with the aforementioned system config.

However, when switching to accelerator="ddp" in your demo script, I get the original entire-system-hang issue I saw in minimaxir/aitextgen#103 . Is that expected?

kaushikb11 · 2021-04-10T22:02:16Z

@minimaxir Are you trying the ddp accelerator on a Jupyter notebook?

minimaxir · 2021-04-10T22:07:20Z

Sorry, that freeze was not in a notebook, just by running python3 script.py in a SSH session terminal.

kaushikb11 · 2021-04-10T22:13:13Z

That's strange! You shouldn't be, I remember testing it out before sharing the script.
Let me try it again now!

kaushikb11 · 2021-04-10T22:29:11Z

@minimaxir It worked on my end, I have a hunch it must be something to do with can't find '__main__' module in '/home/jupyter' Error in minimaxir/aitextgen#103 (comment) on your side.

kaushikb11 · 2021-04-27T12:09:01Z

Feel free to reopen the issue, if you have any related queries.

minimaxir added bug Something isn't working help wanted Open to be worked on labels Apr 3, 2021

kaushikb11 self-assigned this Apr 5, 2021

edenlightning added this to the v1.3 milestone Apr 27, 2021

kaushikb11 closed this as completed Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training fails when using GCP Deep Learning image #6812

Multi-GPU training fails when using GCP Deep Learning image #6812

minimaxir commented Apr 3, 2021

kaushikb11 commented Apr 7, 2021

minimaxir commented Apr 10, 2021

kaushikb11 commented Apr 10, 2021

minimaxir commented Apr 10, 2021

kaushikb11 commented Apr 10, 2021

kaushikb11 commented Apr 10, 2021 •

edited

Loading

kaushikb11 commented Apr 27, 2021

Multi-GPU training fails when using GCP Deep Learning image #6812

Multi-GPU training fails when using GCP Deep Learning image #6812

Comments

minimaxir commented Apr 3, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

kaushikb11 commented Apr 7, 2021

minimaxir commented Apr 10, 2021

kaushikb11 commented Apr 10, 2021

minimaxir commented Apr 10, 2021

kaushikb11 commented Apr 10, 2021

kaushikb11 commented Apr 10, 2021 • edited Loading

kaushikb11 commented Apr 27, 2021

kaushikb11 commented Apr 10, 2021 •

edited

Loading