Error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory #20558

Neronjust2017 · 2025-01-22T11:49:59Z

Bug description

I’m using pytorch lighting DDP training with batch size = 16, 8 (gpu per node) * 2 (2 nodes) = 16 total gpus. However, I got the following
error, which happens in ModelCheckpoint callback. There seems to be an error during synchronization between nodes when saving the model checkpoint. And I decreased the batch size to 4 and this error disappeared. Can anyone help me?

      - type: ModelCheckpoint
        every_n_train_steps: 2000
        save_top_k: 30
        monitor: "step"
        filename: "checkpoint_{epoch}-{step}"

Stack:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/workspace/[email protected]/xpilot_vision/ai_foundation/projects/e2e_aeb/main.py", line 130, in <module>
[rank2]:     main()
[rank2]:   File "/workspace/[email protected]/xpilot_vision/ai_foundation/projects/e2e_aeb/main.py", line 121, in main
[rank2]:     runner.train(resume_from=ckpt_path)
[rank2]:   File "/workspace/[email protected]/xpilot_vision/ai_foundation/projects/e2e_aeb/flow/runner/xflow_runner.py", line 38, in train
[rank2]:     self.trainer.fit(
[rank2]:   File "/workspace/[email protected]/xpilot_vision/ai_foundation/xflow/xflow/lightning/trainer/xflow_trainer.py", line 356, in fit
[rank2]:     super().fit(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
[rank2]:     call._call_and_handle_interrupt(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank2]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]:     return function(*args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
[rank2]:     self._run(model, ckpt_path=ckpt_path)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
[rank2]:     results = self._run_stage()
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
[rank2]:     self.fit_loop.run()
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 206, in run
[rank2]:     self.on_advance_end()
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 378, in on_advance_end
[rank2]:     call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 210, in _call_callback_hooks
[rank2]:     fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 323, in on_train_epoch_end
[rank2]:     self._save_topk_checkpoint(trainer, monitor_candidates)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 383, in _save_topk_checkpoint
[rank2]:     self._save_monitor_checkpoint(trainer, monitor_candidates)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 703, in _save_monitor_checkpoint
[rank2]:     self._update_best_and_save(current, trainer, monitor_candidates)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 732, in _update_best_and_save
[rank2]:     filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer, del_filepath)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 661, in _get_metric_interpolated_filepath_name
[rank2]:     while self.file_exists(filepath, trainer) and filepath != del_filepath:
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 774, in file_exists
[rank2]:     return trainer.strategy.broadcast(exists)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
[rank2]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2636, in broadcast_object_list
[rank2]:     object_tensor = torch.empty(  # type: ignore[call-overload]
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.

What version are you seeing the problem on?

v2.3

How to reproduce the bug

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

The text was updated successfully, but these errors were encountered:

Neronjust2017 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jan 22, 2025

github-actions bot added the ver: 2.3.x label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory #20558

Error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory #20558

Neronjust2017 commented Jan 22, 2025

Error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory #20558

Error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory #20558

Comments

Neronjust2017 commented Jan 22, 2025

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info