Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train on multiple GPUs with nn.DataParallel() #99

Open
delitroz opened this issue Oct 27, 2021 · 1 comment
Open

Train on multiple GPUs with nn.DataParallel() #99

delitroz opened this issue Oct 27, 2021 · 1 comment

Comments

@delitroz
Copy link

delitroz commented Oct 27, 2021

Hello,

I have issue trying to train an INN on multiple GPUs using nn.DataParallel().
I recreated a minimal example of the issue from the example at the beginning of the ReadMe. When initializing the inn using Ff.SequenceINN() (commented in the code below), all goes as intended and I can see all of my GPUs being loaded. But when using Ff.ReversibleGraphNet(), I get an error saying that there is a mismatch between several tensors' devices (note: this initialization works fine on a single GPU).
Since I am planning to implement a convolutional conditional INN I will need something more expressive than just a SequenceINN() that supports splitting.

Does anyone had success training such model on multiple GPUs?
Thanks.

import torch
import torch.nn as nn
from sklearn.datasets import make_moons

import FrEIA.framework as Ff
import FrEIA.modules as Fm

from tqdm import trange

BATCHSIZE = 1000
N_DIM = 2

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def subnet_fc(dims_in, dims_out):
    return nn.Sequential(nn.Linear(dims_in, 512), nn.ReLU(),
                         nn.Linear(512,  dims_out))

# inn = Ff.SequenceINN(N_DIM)
# for k in range(8):
#     inn.append(Fm.AllInOneBlock, subnet_constructor=subnet_fc, permute_soft=True)

nodes = []
nodes.append(Ff.InputNode(2))
for k in range(8):
    nodes.append(Ff.Node(nodes[-1],
                Fm.AllInOneBlock,
                {'subnet_constructor':subnet_fc, 'permute_soft':True}))
nodes.append(Ff.OutputNode(nodes[-1]))
inn = Ff.ReversibleGraphNet(nodes)

inn = nn.DataParallel(inn)
inn = inn.to(device)

optimizer = torch.optim.Adam(inn.parameters(), lr=0.001)

for i in trange(1000):
    optimizer.zero_grad()
    data, label = make_moons(n_samples=BATCHSIZE, noise=0.05)
    x = torch.Tensor(data).to(device)
    z, log_jac_det = inn(x)
    loss = 0.5*torch.sum(z**2, 1) - log_jac_det
    loss = loss.mean() / N_DIM
    loss.backward()
    optimizer.step()

z = torch.randn(100, N_DIM).to(device)
samples, _ = inn(z, rev=True)
@shakediel
Copy link

did you manage to find a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants