-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU training fails when using GCP Deep Learning image #6812
Comments
Hi @minimaxir! The Error is not specific to the GCP Deep Learning Image. As mentioned in the trace, you would need to have Required changes to your Script: import os
import torch
from torch.utils.data import DataLoader, Dataset
import pytorch_lightning as pl
from pytorch_lightning import LightningModule
tmpdir = os.getcwd()
class RandomDataset(Dataset):
def __init__(self, size, num_samples):
self.len = num_samples
self.data = torch.randn(num_samples, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
# An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"loss": loss}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def training_epoch_end(self, outputs) -> None:
torch.stack([x["loss"] for x in outputs]).mean()
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"x": loss}
def validation_epoch_end(self, outputs) -> None:
torch.stack([x["x"] for x in outputs]).mean()
def test_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log("fake_test_acc", loss)
return {"y": loss}
def test_epoch_end(self, outputs) -> None:
torch.stack([x["y"] for x in outputs]).mean()
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
return [optimizer], [lr_scheduler]
def test_run():
num_samples = 10000
train = RandomDataset(32, num_samples)
train = DataLoader(train, batch_size=32)
val = RandomDataset(32, num_samples)
val = DataLoader(val, batch_size=32)
test = RandomDataset(32, num_samples)
test = DataLoader(test, batch_size=32)
model = BoringModel()
# Initialize a trainer
trainer = pl.Trainer(max_epochs=1, progress_bar_refresh_rate=20, gpus=4, accelerator="ddp_spawn")
# Train the model ⚡
trainer.fit(model, train, val)
trainer.test(test_dataloaders=test)
if __name__ == '__main__':
test_run() |
Thanks for the clarification! I can verify that script works with the aforementioned system config. However, when switching to |
@minimaxir Are you trying the |
Sorry, that freeze was not in a notebook, just by running |
That's strange! You shouldn't be, I remember testing it out before sharing the script. |
@minimaxir It worked on my end, I have a hunch it must be something to do with |
Feel free to reopen the issue, if you have any related queries. |
🐛 Bug
Multi-GPU training fails when using GCP Deep Learning image. Occurs when using terminal. Occurs with
dp
andddp_spawn
accelerators; does not occur with addp
accelerator. Does not occur when using the same system for single-GPU training.To Reproduce
Create a GCP VM w/ the following properties:
After SSHing into the system and installing CUDA drivers (may need to run
sudo /opt/deeplearning/install-driver.sh
), install pytorch-lightning viapip3 install pytorch-lightning
.Then run:
Expected behavior
Environment
- GPU:
- Tesla T4
- Tesla T4
- Tesla T4
- Tesla T4
- available: True
- version: 11.1
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.8.0
- pytorch-lightning: 1.2.6
- tqdm: 4.59.0
- OS: Linux
- architecture:
- 64bit
-
- processor:
- python: 3.7.10
- version: Proposal for help #1 SMP Debian 4.19.181-1 (2021-03-19)
Additional context
This was the issue I hit when debugging minimaxir/aitextgen#103; since it occurs with the BoringModel it may not be aitextgen's fault (maybe) cc @SeanNaren
The text was updated successfully, but these errors were encountered: