Why is axolotl reporting near-zero GPU memory usage while training? (FSDP Llama 3.1 8B Liger example) #2284

mashdragon · 2025-01-24T01:31:25Z

mashdragon
Jan 24, 2025

This is what I see in the logs. Is it accurate? I'm curious why so little memory is being used for training.

[2025-01-23 05:41:33,060] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:197] [PID:438307] [RANK:0] gather_len_batches: [152, 152]
{'eval_loss': 2.2564399242401123, 'eval_runtime': 412.1633, 'eval_samples_per_second': 0.463, 'eval_steps_per_second': 0.233, 'epoch': 0.0}
[2025-01-23 05:53:46,588] [INFO] [axolotl.callbacks.on_step_end:128] [PID:438308] [RANK:1] cuda memory usage while training: 0.016GB (+19.925GB cache, +0.942GB misc)
[2025-01-23 05:53:49,166] [INFO] [axolotl.callbacks.on_step_end:128] [PID:438307] [RANK:0] cuda memory usage while training: 0.016GB (+19.925GB cache, +0.940GB misc)

My swap space is getting hammered. (2x24 GB GPU, 96 GB RAM, 60 GB swap)

NanoCode012 · 2025-01-28T05:19:58Z

NanoCode012
Jan 28, 2025
Collaborator

Hey, the number seems very unusual even though it's at the start of training. The cache does show being increased though. We pull data from torch.cuda.memory_allocated for that low value.

My swap space is getting hammered.

High RAM use could be due to FSDP offloading + Liger offload.

2 replies

mashdragon Feb 8, 2025
Author

I still see this all the time - the value in this log message unfortunately doesn't inform much. Usually it OOMs (on Cuda) early on so I have to reduce the context size until it does not anymore. Sometimes I get OOM after a number of steps during training.

NanoCode012 Feb 10, 2025
Collaborator

Yes, I'm not sure we could do much on the numbers being inconsistent as we load it from the PyTorch function directly.

I would recommend having nvidia-smi on watch on another tab just to keep track as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is axolotl reporting near-zero GPU memory usage while training? (FSDP Llama 3.1 8B Liger example) #2284

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why is axolotl reporting near-zero GPU memory usage while training? (FSDP Llama 3.1 8B Liger example) #2284

mashdragon Jan 24, 2025

Replies: 1 comment · 2 replies

NanoCode012 Jan 28, 2025 Collaborator

mashdragon Feb 8, 2025 Author

NanoCode012 Feb 10, 2025 Collaborator

mashdragon
Jan 24, 2025

Replies: 1 comment 2 replies

NanoCode012
Jan 28, 2025
Collaborator

mashdragon Feb 8, 2025
Author

NanoCode012 Feb 10, 2025
Collaborator