Skip to content

v0.22.0: Distributed operation framework, Gradient Accumulation enhancements, FSDP enhancements, and more!

Compare
Choose a tag to compare
@muellerzr muellerzr released this 23 Aug 06:26
· 565 commits to main since this release
6b3e559

Experimental distributed operations checking framework

A new framework has been introduced which can help catch timeout errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1" to enable this. Read more in the docs, introduced via #1756

Accelerator.load_state can now load the most recent checkpoint automatically

If a ProjectConfiguration has been made, using accelerator.load_state() (without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via #1741

Multiple enhancements to gradient accumulation

In this release multiple new enhancements to distributed gradient accumulation have been added.

  • accelerator.accumulate() now supports passing in multiple models introduced via #1708
  • A util has been introduced to perform multiple forwards, then multiple backwards, and finally sync the gradients only on the last .backward() via #1726

FSDP Changes

  • FSDP support has been added for NPU and XPU devices via #1803 and #1806
  • A new method for supporting RAM-efficient loading of models with FSDP has been added via #1777

DataLoader Changes

  • Custom slice functions are now supported in the DataLoaderDispatcher added via #1846

What's New?

Significant community contributions

The following contributors have made significant changes to the library over the last release:

Full Changelog: v0.21.0...v0.22.0