Gradient checkpointing #78

neverix · 2021-12-05T08:00:44Z

This patch enables gradient checkpointing for ruDALLE.

It's possible to use up to 3x higher batch sizes in memory-limited environments during training.

Setting the gradient_checkpointing during model.forward makes a checkpoint every gradient_checkpointing layers.
6 is a good starting value.

for more information, see https://pre-commit.ci

neverix · 2021-12-14T18:08:16Z

:(

shonenkov · 2021-12-15T11:38:12Z

@neverix we have a gradient checkpointing in the main training pipeline, but your version doesn't support deepspeed - so I can't merge your code right now - I beware of lost compatibility of our internal and open-source models :(

I suggest creating a new branch with your version gradient checkpointing, what about it?

neverix · 2021-12-15T14:35:52Z

What changes are needed for compatibility with DeepSpeed? Having this included in the default install is useful for notebooks, if it's not possible we should look for different solutions

neverix and others added 6 commits December 5, 2021 10:47

Gradient checkpointing: part 1

60a169e

Gradient checkpointing: part 2

975f26d

[pre-commit.ci] auto fixes from pre-commit.com hooks

e0d3e7a

for more information, see https://pre-commit.ci

Patch (wrong name)

36f5ac0

Appease pre-commit.ci

704d866

Fix bug

148b99d

Merge branch 'master' into patch-3

4ede11a

shonenkov merged commit a6e01de into ai-forever:master Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient checkpointing #78

Gradient checkpointing #78

neverix commented Dec 5, 2021

neverix commented Dec 14, 2021

shonenkov commented Dec 15, 2021

neverix commented Dec 15, 2021

Gradient checkpointing #78

Gradient checkpointing #78

Conversation

neverix commented Dec 5, 2021

neverix commented Dec 14, 2021

shonenkov commented Dec 15, 2021

neverix commented Dec 15, 2021