Skip to content

GPU ram memory increase until overflow when using PSNR and SSIM #2597

Closed
@ouioui199

Description

@ouioui199

🐛 Bug

Hello all,

I'm implementing CycleGAN with Lightning. I use PSNR and SSIM from torchmetrics for evaluation.
During training, I see that my GPU ram memory increases non stop until overflow and the whole training shuts down.
This might similar to #2481

To Reproduce

Add this to init method of model class:

self.train_metrics = MetricCollection({"PSNR": PeakSignalNoiseRatio(), "SSIM": StructuralSimilarityIndexMeasure()})
self.valid_metrics = self.train_metrics.clone(prefix='val_')

In training_step method:
train_metrics = self.train_metrics(fake, real)

In validation_step method:
valid_metrics = self.valid_metrics(fake, real)

Environment

  • TorchMetrics version: 1.3.0 installed via pip
  • Python: 3.11.7
  • Pytorch: 2.1.2
  • Issue encountered when training on Window 10

Easy fix proposition

I try to debug the code.
When verifying train_metrics, I get this:

"{'PSNR': tensor(10.5713, device='cuda:0', grad_fn=<SqueezeBackward0>), 'SSIM': tensor(0.0373, device='cuda:0', grad_fn=<SqueezeBackward0>)}"

which is weird because metrics aren't supposed to be attached to computational graph.
When verifying valid_metrics, I don't see grad_fn.
Guessing that's the issue, I tried to call fake.detach() when computing train_metrics.
Now the training is stable, the GPU memory stops increasing non stop.

Metadata

Metadata

Assignees

Labels

bug / fixSomething isn't workingquestionFurther information is requestedv1.3.x

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions