CUDA Memory Issues #227

martinez-zacharya · 2022-06-29T20:52:50Z

martinez-zacharya
Jun 29, 2022

I'm running into OOM errors whenever I try to fine-tune esm1 with pytorch data distributed. I'm currently trying to use FairScale to help with the memory issues, but it does not seem to be enough. Does anyone have any other solutions?

naailkhan28 · 2022-06-29T21:15:46Z

naailkhan28
Jun 29, 2022

I've experimented with using ESM-1b for classification by adding an additional output layer and fine tuning all the weights - and I did find that a batch size above 2 or 3 sequences would use up all the memory in Google Colab.

My eventual solution was to use a server with more GPU memory, but depending on what you're doing could you freeze some of the layers of the model to reduce the number of trainable parameters?

I guess some other alternatives could be to explore model parallel (but you might've tried this already using FairScale) or gradient checkpointing?

0 replies

ghost · 2022-07-01T22:47:25Z

ghost
Jul 1, 2022

Your model is too big to fit into GPU memory: One solution I usually use is to reduce the batch size, or max token size when working with language models. If the problem still persists, try distributed training with model or data parallelism, it can optimize ur taining time dramatically. I hope I was helpful :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Memory Issues #227

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

CUDA Memory Issues #227

martinez-zacharya Jun 29, 2022

Replies: 2 comments

naailkhan28 Jun 29, 2022

ghost Jul 1, 2022

martinez-zacharya
Jun 29, 2022

naailkhan28
Jun 29, 2022

ghost
Jul 1, 2022