You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been using Meta's Implementation of Distributed Shampoo and am seeing ~20% faster convergence of transformer based models compared to AdamW. Simo Ryu has done some nice investigations into the advantages of Shampoo.
I am looking to use Shampoo and Soap as an optimizer in accelerate but their current implementations introduce some breaking changes.
Focusing on Shampoo for now:
Distributed Shampoo disabled state_dict and load_state_dict in favor of a custom distributed_state_dict , load_distributed_state_dict. Both of which require the models named_parameters() to be passed in as args. More info as to why here
I have a hacky commit here to patch accelerate/optimizers. However I am still forced to bypass accelerate.save() and use dist_checkpoint.save_state_dict() directly since the optimizer in the state_dict needs to have access to the models named_parameters().
You can see this here in my e2-tts training code. I am able to save the model weights but am not yet able to load them again when using accelerate. This is where I am lost currently.
Also since I don't have access to the named_parameters until accelerate.prepare_model() is called the shampoo optimizer needs to be defined in the model definition, which makes it awkward to switch between optimizers, see here
Ideally id be able to do something like this where I pass in the optimizer as I can with AdamW.
ofc when I setup everything with torch ddp, instead of accelerate everything works as intended :/
What would be the best approach for accelerate to support these custom optimizers (ones not part of torch)? My plan currently is to write a ShampooPlugin along the lines of the DeepSpeedPlugin, but it would be nice if the shampoo optimizer could be detected automatically without having to change the accelerate config. I am willing to put in the work to solve this so more projects can benefit from using these new optimizers with accelerate.
Any guidance would be much appreciated. :)
The text was updated successfully, but these errors were encountered:
that optim lacks other torch-specific expectations and like learning rate schedulers, some off-the-wall optimisers just don't work without modification. that distributed zero shampoo optim is a WIP technical prototype and not meant to be used in production, for example.
other problems of the original optim implementations linked is that they are not functioning with torch.compile and retain very slow performance (exaggerated in SOAP) and high memory overhead (also exaggerated in SOAP) even with ZeRO offload
Hey,
I have been using Meta's Implementation of Distributed Shampoo and am seeing ~20% faster convergence of transformer based models compared to AdamW. Simo Ryu has done some nice investigations into the advantages of Shampoo.
I am looking to use Shampoo and Soap as an optimizer in accelerate but their current implementations introduce some breaking changes.
Focusing on Shampoo for now:
Distributed Shampoo disabled
state_dict
andload_state_dict
in favor of a customdistributed_state_dict
,load_distributed_state_dict
. Both of which require the modelsnamed_parameters()
to be passed in as args. More info as to why hereI have a hacky commit here to patch
accelerate/optimizers
. However I am still forced to bypassaccelerate.save()
and usedist_checkpoint.save_state_dict()
directly since the optimizer in the state_dict needs to have access to the modelsnamed_parameters()
.You can see this here in my e2-tts training code. I am able to save the model weights but am not yet able to load them again when using accelerate. This is where I am lost currently.
Also since I don't have access to the
named_parameters
untilaccelerate.prepare_model()
is called the shampoo optimizer needs to be defined in the model definition, which makes it awkward to switch between optimizers, see hereIdeally id be able to do something like this where I pass in the optimizer as I can with AdamW.
ofc when I setup everything with torch ddp, instead of accelerate everything works as intended :/
What would be the best approach for accelerate to support these custom optimizers (ones not part of torch)? My plan currently is to write a ShampooPlugin along the lines of the DeepSpeedPlugin, but it would be nice if the shampoo optimizer could be detected automatically without having to change the accelerate config. I am willing to put in the work to solve this so more projects can benefit from using these new optimizers with accelerate.
Any guidance would be much appreciated. :)
The text was updated successfully, but these errors were encountered: