Dataset folder in distributed trainings #1001

erayinanc · 2022-07-28T15:52:34Z

erayinanc
Jul 28, 2022

For training on an HPC with a multi-GPU per node, I am trying to load a bunch of .hdf5 files in a folder using torchvision. For this, in PyTorch, I can use torchvision.datasets.DatasetFolder, then define a sampler with torch.utils.data.distributed.DistributedSampler to for the data loader. But in HeAT, I was not able to find DatasetFolder class, or similar. So, I tried to combine torchvision with heat, as

# dataset
transform = torchvision.transforms.Compose([transforms.Normalize((0.0, 0.0, 0.0), (1.0, 1.0, 1.0))])
train_dataset = heat.utils.vision_transforms.torchvision.datasets.DatasetFolder(args.data_dir,\
      loader=hdf5_loader, transform=transform, extensions='.hdf5') # hdf5_loader function reads hdf5 file and returns torch array
setattr(train_dataset, 'ishuffle', False)
 
# dataloader
kwargs = {"batch_size": args.batch_size}
kwargs.update({"num_workers": args.nworker, "pin_memory": True})
train_loader = heat.utils.data.datatools.DataLoader(dataset=train_dataset, **kwargs)

# ML model
model = somemlmodel().to('cuda')

# distribute
optimizer = heat.optim.torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=0.003)
dp_optimizer = heat.optim.DataParallelOptimizer(optimizer, blocking=False)
distrib_model = heat.nn.DataParallel(model, comm=heat.get_com(), optimizer=dp_optimizer,\
      blocking_parameter_updates=blocking)

This approach works if each node has 1 GPU, but fails if more than that -- probably due to torchvision part -- getting NaNs in training.

So, my question is how one should do this in HeAT properly.

All the best,
Eray

ClaudiaComito · 2022-08-12T15:14:05Z

ClaudiaComito
Aug 12, 2022
Maintainer

Thanks for your question @erayinanc , pinging @coquelin77 as the most qualified to answer here.

0 replies

coquelin77 · 2022-11-24T13:24:04Z

coquelin77
Nov 24, 2022
Maintainer

NaNs in training can come from any number of things. If it was related to the data it would likely be shown as an error and not as NaNs. The most likely culprit is a learning rate which is too large which can cause the loss value to explode towards NaNs.

1 reply

erayinanc Nov 24, 2022
Author

Hi,

I doubt that. I tried the same case for over 1024 nodes each with 1 GPU without a problem, but NaNs appear even with 1 node with 2 GPUs. I also tested relevant FWs with the same setup (without any issues) and tried different datasets (such as MNIST and ImageNet). It would be great if you could show me how to use Torchvision with HeAT with maybe an example.

Many thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset folder in distributed trainings #1001

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Dataset folder in distributed trainings #1001

erayinanc Jul 28, 2022

Replies: 2 comments · 1 reply

ClaudiaComito Aug 12, 2022 Maintainer

coquelin77 Nov 24, 2022 Maintainer

erayinanc Nov 24, 2022 Author

erayinanc
Jul 28, 2022

Replies: 2 comments 1 reply

ClaudiaComito
Aug 12, 2022
Maintainer

coquelin77
Nov 24, 2022
Maintainer

erayinanc Nov 24, 2022
Author