GPU memory usage for training #24

Vincent-luo · 2024-07-18T09:18:51Z

I have a question about the GPU memory usage for model training. I'm using a V100 32GB GPU, but I'm encountering "CUDA out of memory" errors when training for the first stage with default setting. This happens even when I set the gradient_accumulation_steps to 1. I would like to know how much VRAM is really needed for model training. I'm not sure if there's something wrong in my setup because your paper mentions that you also used V100 GPUs for training.

lhd777 · 2024-07-19T08:47:31Z

Megactor requires just 32GB of VRAM for training. In fact, our experimental setup during training consisted of 8 V100 GPUs. If you encounter a GPU Out of Memory situation, there could be several reasons for this, such as other processes occupying the GPU.
The simplest way to train with fewer memory issues is to reduce the video length:

https://github.com/megvii-research/megactor/blob/main/configs/train/train_stage1.yaml#L61

Or you can turn off the motion layer in your 2D traing stage, and then turn on the motion layer in 3D training stage (The open-source version is a little bit differnce from our paper, because we find it's also ok for training 2D & 3D at the same time. Your can train megactor on your favorite.):

https://github.com/megvii-research/megactor/blob/main/configs/train/train_stage1.yaml#L22

Vincent-luo · 2024-07-19T15:02:31Z

Thanks for your quick reply! I'm not very familiar with the deepspeed setting, should I uncomment these lines, it seems the training doesn't use mixed_precision: fp16

megactor/configs/accelerate_deepspeed.yaml

Lines 23 to 35 in 16e7cdf

    
           # downcast_bf16: 'no' 
        
           # fsdp_config: {} 
        
           # gpu_ids: 0,1,2,3 
        
           # machine_rank: 0 
        
           # main_process_ip: null 
        
           # main_process_port: null 
        
           # main_training_function: main 
        
           # mixed_precision: fp16 
        
           # num_machines: 1 
        
           num_processes: 4 
        
           # rdzv_backend: static 
        
           # same_network: true 
        
           # use_cpu: false

YZX-codesky · 2024-09-30T01:25:12Z

Hello, have you succeeded in replicating? When I was processing the data set, there was a problem in the fourth part, the size of the generated swapped.mp4 videos are all 0, can you share the videos you generated?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory usage for training #24

GPU memory usage for training #24

Vincent-luo commented Jul 18, 2024

lhd777 commented Jul 19, 2024 •

edited

Loading

Vincent-luo commented Jul 19, 2024

YZX-codesky commented Sep 30, 2024

GPU memory usage for training #24

GPU memory usage for training #24

Comments

Vincent-luo commented Jul 18, 2024

lhd777 commented Jul 19, 2024 • edited Loading

Vincent-luo commented Jul 19, 2024

YZX-codesky commented Sep 30, 2024

lhd777 commented Jul 19, 2024 •

edited

Loading