Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training fails when using GCP Deep Learning image #6812

Closed
minimaxir opened this issue Apr 3, 2021 · 7 comments
Closed

Multi-GPU training fails when using GCP Deep Learning image #6812

minimaxir opened this issue Apr 3, 2021 · 7 comments
Assignees
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@minimaxir
Copy link

🐛 Bug

Multi-GPU training fails when using GCP Deep Learning image. Occurs when using terminal. Occurs with dp and ddp_spawn accelerators; does not occur with a ddp accelerator. Does not occur when using the same system for single-GPU training.

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: Y
ou requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2
")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: Y
ou requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2
")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 114, in _main
    prepare(preparation_data)
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "/opt/conda/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/opt/conda/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/max/boring_model_multigpu_(2).py", line 88, in <module>
    trainer.fit(model, train, val)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in f
it
    self.dispatch()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in d
ispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 
73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py"
, line 108, in start_training
    mp.spawn(self.new_process, **self.mp_spawn_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 179, in start_p
rocesses
    process.start()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.
        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:
            if __name__ == '__main__':
                freeze_support()
                ...
        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

To Reproduce

Create a GCP VM w/ the following properties:

  1. n1-highmem-2, 4 T4 GPUs
  2. Deep Learning on Linux OS, PyTorch 1.8 m110 m66 version
  3. Allow HTTP/HTTPS Traffic, Preemptible On

After SSHing into the system and installing CUDA drivers (may need to run sudo /opt/deeplearning/install-driver.sh), install pytorch-lightning via pip3 install pytorch-lightning.

Then run:

import os

import torch

from torch.utils.data import DataLoader, Dataset
import pytorch_lightning as pl
from pytorch_lightning import LightningModule

tmpdir = os.getcwd()


class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x["x"] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log("fake_test_acc", loss)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


num_samples = 10000

train = RandomDataset(32, num_samples)
train = DataLoader(train, batch_size=32)

val = RandomDataset(32, num_samples)
val = DataLoader(val, batch_size=32)

test = RandomDataset(32, num_samples)
test = DataLoader(test, batch_size=32)


model = BoringModel()

# Initialize a trainer
trainer = pl.Trainer(max_epochs=1, progress_bar_refresh_rate=20, gpus=4)

# Train the model ⚡
trainer.fit(model, train, val)

Expected behavior

Environment

  • CUDA:
    - GPU:
    - Tesla T4
    - Tesla T4
    - Tesla T4
    - Tesla T4
    - available: True
    - version: 11.1
  • Packages:
    - numpy: 1.19.5
    - pyTorch_debug: False
    - pyTorch_version: 1.8.0
    - pytorch-lightning: 1.2.6
    - tqdm: 4.59.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor:
    - python: 3.7.10
    - version: Proposal for help #1 SMP Debian 4.19.181-1 (2021-03-19)

Additional context

This was the issue I hit when debugging minimaxir/aitextgen#103; since it occurs with the BoringModel it may not be aitextgen's fault (maybe) cc @SeanNaren

@minimaxir minimaxir added bug Something isn't working help wanted Open to be worked on labels Apr 3, 2021
@kaushikb11 kaushikb11 self-assigned this Apr 5, 2021
@kaushikb11
Copy link
Contributor

Hi @minimaxir!

The Error is not specific to the GCP Deep Learning Image. As mentioned in the trace, you would need to have if __name__ == '__main__': in your script for the spawn to make it work.

Required changes to your Script:

import os

import torch

from torch.utils.data import DataLoader, Dataset
import pytorch_lightning as pl
from pytorch_lightning import LightningModule

tmpdir = os.getcwd()


class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x["x"] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log("fake_test_acc", loss)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]

def test_run():
    num_samples = 10000

    train = RandomDataset(32, num_samples)
    train = DataLoader(train, batch_size=32)

    val = RandomDataset(32, num_samples)
    val = DataLoader(val, batch_size=32)

    test = RandomDataset(32, num_samples)
    test = DataLoader(test, batch_size=32)


    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(max_epochs=1, progress_bar_refresh_rate=20, gpus=4, accelerator="ddp_spawn")

    # Train the model ⚡
    trainer.fit(model, train, val)
    trainer.test(test_dataloaders=test)
    
if __name__ == '__main__':
    test_run()

@minimaxir
Copy link
Author

Thanks for the clarification! I can verify that script works with the aforementioned system config.

However, when switching to accelerator="ddp" in your demo script, I get the original entire-system-hang issue I saw in minimaxir/aitextgen#103 . Is that expected?

@kaushikb11
Copy link
Contributor

@minimaxir Are you trying the ddp accelerator on a Jupyter notebook?

@minimaxir
Copy link
Author

Sorry, that freeze was not in a notebook, just by running python3 script.py in a SSH session terminal.

@kaushikb11
Copy link
Contributor

That's strange! You shouldn't be, I remember testing it out before sharing the script.
Let me try it again now!

@kaushikb11
Copy link
Contributor

kaushikb11 commented Apr 10, 2021

@minimaxir It worked on my end, I have a hunch it must be something to do with can't find '__main__' module in '/home/jupyter' Error in minimaxir/aitextgen#103 (comment) on your side.

Screen Shot 2021-04-11 at 3 56 53 AM

@edenlightning edenlightning added this to the v1.3 milestone Apr 27, 2021
@kaushikb11
Copy link
Contributor

Feel free to reopen the issue, if you have any related queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants