Distributed training performance slowdown when resuming from a checkpoint. #184

subhashbylaiah · 2022-07-22T14:45:18Z

I am using ray_lightning to distribute training across a 8 node ray cluster with GPU. I am seeing the training performance significantly slow down (by a factor of 2-3) when resuming from a checkpoint. When I start a new training it takes an average of 35 minutes per epoch. But, when I restart the training from a previous checkpoint it takes over 90 minutes per epoch. This behavior is pretty consistent. I am using CLI to submit the job to the remote Ray cluster.
To isolate this problem I also tried to run a multi-node distributed training just using Pytorch lightning, and this was not a problem. Resuming from a checkpoint took just about the same time as with a fresh training run.

I have the code here to reproduce the example, along with instructions to run it.

https://gist.github.com/subhashbylaiah/6b403339cfaf619c59af403f9740bf29

From my analysis, as also I have shared in the notes to reproduce this, I see the cause of the issue to be somehow associated with the precision of the input images.

When the input tensors are in float16 performance slows down when training from a prior checkpoint (no issue when training from scratch).
When the input tensors are in float32 performance is good whether training from a checkpoint or from scratch.

BTW, the trainer precision is still fp16 in both cases.

Library versions

ray==1.12.1
ray_lightning==0.2.0
pytorch_lightning==1.5.10
pytorch==1.11.0

The text was updated successfully, but these errors were encountered:

JiahaoYao · 2022-07-26T16:33:35Z

Hi @subhashbylaiah, i see the assumption here is the model uses float16 for the model datatype.

In the ddp source code, it uses torch.float i.e. https://pytorch.org/docs/stable/tensors.html#data-types. You can find here

ray_lightning/ray_lightning/ray_ddp.py

Lines 377 to 386 in 6aed848

    
           # From DDPSpawn.get_queue 
        
           self.lightning_module.trainer.callback_metrics.update( 
        
               apply_to_collection(callback_metrics, 
        
                                   np.ndarray, lambda x: torch.tensor(x))) 
        
           # Same for logged_metrics 
        
           self.lightning_module.trainer.logged_metrics.update( 
        
               apply_to_collection(logged_metrics, 
        
                                   np.ndarray, lambda x: torch.tensor(x)))

i am going to test this assumption, and keep u posted here.

JiahaoYao · 2022-07-26T18:44:29Z

i do see this

JiahaoYao · 2022-07-26T18:48:49Z

on the other hand, use the ddp. There is no extra memory.

JiahaoYao · 2022-07-26T18:59:47Z

@amogkam, my current guess for this issue is as follows.

the trainer is using the delayed gpu accelerator. the checkpoint is gpu checkpoint.

When resuming from the checkpoint, it load the gpu checkpoint. and the speed might be also due to load the gpu checkpoint from the cpu and then moving to the gpu.

subhashbylaiah · 2022-07-27T14:10:29Z

@amogkam, my current guess for this issue is as follows.

the trainer is using the delayed gpu accelerator. the checkpoint is gpu checkpoint.

When resuming from the checkpoint, it load the gpu checkpoint. and the speed might be also due to load the gpu checkpoint from the cpu and then moving to the gpu.

Thanks @JiahaoYao for checking on this issue. Can you please confirm if you are able to reproduce this issue with the example code?

subhashbylaiah changed the title ~~Training performance slowdown when resuming from a checkpoint and distributing training over a multi-node GPU ray cluster.~~ Distributed training performance slowdown when resuming from a checkpoint. Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training performance slowdown when resuming from a checkpoint. #184

Distributed training performance slowdown when resuming from a checkpoint. #184

subhashbylaiah commented Jul 22, 2022 •

edited

Loading

JiahaoYao commented Jul 26, 2022

JiahaoYao commented Jul 26, 2022

JiahaoYao commented Jul 26, 2022

JiahaoYao commented Jul 26, 2022

subhashbylaiah commented Jul 27, 2022

Distributed training performance slowdown when resuming from a checkpoint. #184

Distributed training performance slowdown when resuming from a checkpoint. #184

Comments

subhashbylaiah commented Jul 22, 2022 • edited Loading

JiahaoYao commented Jul 26, 2022

JiahaoYao commented Jul 26, 2022

JiahaoYao commented Jul 26, 2022

JiahaoYao commented Jul 26, 2022

subhashbylaiah commented Jul 27, 2022

subhashbylaiah commented Jul 22, 2022 •

edited

Loading