Skip to content

Releases: huggingface/accelerate

v0.29.0: NUMA affinity control, MLU Support, and DeepSpeed Improvements

05 Apr 14:27
Compare
Choose a tag to compare

Core

  • Accelerate can now optimize NUMA affinity, which can help increase throughput on NVIDIA multi-GPU systems. To enable it either follow the prompt during accelerate config, set the ACCELERATE_CPU_AFFINITY=1 env variable, or manually using the following:
from accelerate.utils import set_numa_affinity

# For GPU 0
set_numa_affinity(0)

Big thanks to @stas00 for the recommendation, request, and feedback during development

  • Allow for setting deterministic algorithms in set_seed by @muellerzr in #2569
  • Fixed the test script for TPU v2/v3 by @vanbasten23 in #2542
  • Cambricon MLU device support introduced by @huismiling in #2552
  • A big refactor was performed to the PartialState and AcceleratorState to allow for easier future-proofing and simplification of adding new devices by @muellerzr in #2576
  • Fixed a reproducibility issue in distributed environments with Dataloader shuffling when using BatchSamplerShard by @universuen in #2584
  • notebook_launcher can use multiple GPUs in Google Colab if using a custom instance that supports multiple GPUs by @StefanTodoran in #2561

Big Model Inference

  • Add log message for RTX 4000 series when performing multi-gpu inference with device_map which can lead to hanging by @SunMarc in #2557
  • Fix load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in #2588

DeepSpeed

  • Fix issue with the mapping of main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in #2495
  • Improve deepspeed env gen by checking for bad keys, by @muellerzr and @ricklamers in #2565
  • We now support custom deepspeed env files. Like normal deepspeed, set it with the DS_ENV_FILE environmental variable by @muellerzr in #2566
  • Resolve ZeRO-3 Initialization Failure in already-started distributed environments by @sword865 in #2578

What's Changed

New Contributors

Full Changelog: v0.28.0...v0.29.0

v0.28.0: DataLoaderConfig, XLA improvements, FSDP + QLORA foundations, Gradient Synchronization Tweaks, and Bug Fixes

12 Mar 16:58
Compare
Choose a tag to compare

Core

  • Introduce a DataLoaderConfiguration and begin deprecation of arguments in the Accelerator
+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)
  • Allow gradients to be synced each data batch while performing gradient accumulation, useful when training in FSDP by @fabianlim in #2531
from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+    num_steps=2, 
    sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)

Torch XLA

  • Support for XLA on the GPU by @anw90 in #2176
  • Enable gradient accumulation on TPU in #2453

FSDP

  • Support downstream FSDP + QLORA support through tweaks by allowing configuration of buffer precision by @pacman100 in #2544

launch changes

What's Changed

New Contributors

Full Changelog: v0.27.2...v0.28.0

v0.27.0: PyTorch 2.2.0 Support, PyTorch-Native Pipeline Parallism, DeepSpeed XPU support, and Bug Fixes

09 Feb 16:30
Compare
Choose a tag to compare

PyTorch 2.2.0 Support

With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it

PyTorch-Native Pipeline Parallel Inference

With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto". This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.

Requires pippy of version 0.2.0 or later (pip install torchpippy -U)

Example usage (combined with accelerate launch or torchrun):

from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
    output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
    output = torch.stack(tuple(output[0]))
    print(output.shape)

DeepSpeed

This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany

What's Changed

New Contributors

Full Changelog: v0.26.1...v0.27.0

v0.26.1: Patch Release

11 Jan 15:26
Compare
Choose a tag to compare

What's Changed

  • Raise error when using batches of different sizes with dispatch_batches=True by @SunMarc in #2325

Full Changelog: v0.26.0...v0.26.1

v0.26.0 - MS-AMP Support, Critical Regression Fixes, and More

11 Jan 14:55
Compare
Choose a tag to compare

Support for MS-AMP

This release adds support for the MS-AMP (Microsoft Automatic Mixed Precision Library) into Accelerate as an alternative backend for doing FP8 training on appropriate hardware. It is the default backend of choice. Read more in the docs here. Introduced in #2232 by @muellerzr

Core

In the prior release a new sampler for the DataLoader was introduced that while across seeds does not show statistical differences in the results, repeating the same seed would result in a different end-accuracy that was scary to some users. We have now disabled this behavior by default as it required some additional setup, and brought back the original implementation. To have the new sampling technique (which can provide more accurate repeated results) pass use_seedable_sampler=True to the Accelerator. We will be propagating this up to the Trainer soon.

Big Model Inference

  • NPU support was added thanks to @statelesshz in #2222
  • When generating an automatic device_map we've made it possible to not returned grouped key results if desired in #2233
  • We now handle corner cases better when users pass device_map="cuda" etc thanks to @younesbelkada in #2254

FSDP and DeepSpeed

  • Many improvements to the docs have been made thanks to @stass. Along with this we've made it easier to adjust the config for the sharding strategy and other config values thanks to @pacman100 in #2288

  • A regression in Accelerate 0.23.0 occurred that showed learning is much slower on multi-GPU setups compared to a single GPU. #2304 has now fixed this thanks to @pacman100

  • The DeepSpeed integration now also handles auto values better when making a configuration in #2313

Bits and Bytes

  • Params4bit added to bnb classes in set_module_tensor_to_device() by @poedator in #2315

Device Agnostic Testing

For developers, we've made it much easier to run the tests on different devices with no change to the code thanks to @statelesshz in #2123 and #2235

Bug Fixes

Major Contributors

  • @statelesshz for their work on device-agnostic testing and NPU support
  • @stas00 for many docfixes when it comes to DeepSpeed and FSDP

General Changelog

New Contributors

Read more

v0.25.0: safetensors by default, new trackers, and plenty of bug fixes

01 Dec 15:24
Compare
Choose a tag to compare

Safetensors default

As of this release, safetensors will be the default format saved when applicable! To read more about safetensors and why it's best to use it for safety (and not pickle/torch.save), check it out here

New Experiment Trackers

This release has two new experiment trackers, ClearML and DVCLive!

To use them, just pass clear_ml or dvclive to log_with in the Accelerator init. h/t to @eugen-ajechiloae-clearml and @dberenbaum

DeepSpeed

  • Accelerate's DeepSpeed integration now supports NPU devices, h/t to @statelesshz
  • DeepSpeed can now be launched via accelerate on single GPU setups

FSDP

FSDP had a huge refactoring so that the interface when using FSDP is the exact same as every other scenario when using accelerate. No more needing to call accelerator.prepare() twice!

Other useful enhancements

  • We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using accelerate launch we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error.

  • When doing .gather(), if tensors are on different devices we explicitly will raise an error (for now only valid on CUDA)

Bug fixes

  • Fixed a bug that caused dataloaders to not shuffle despite shuffle=True when using multiple GPUs and the new SeedableRandomSampler.

General Changelog

New Contributors

Full Changelog: v0.24.1...v0.25.0

v0.24.1: Patch Release for Samplers

30 Oct 14:12
Compare
Choose a tag to compare
  • Fixes #2091 by changing how checking for custom samplers is done

v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements

24 Oct 17:37
Compare
Choose a tag to compare

Improved Reproducibility

One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch() function to all Accelerate DataLoaders, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:

dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)

For more information see this PR, we will update the docs on a subsequent release with more information on this API.

Documentation

  • The quick tour docs have gotten a complete makeover thanks to @MKhalusova. Take a look here
  • We also now have documentation on how to perform multinode training, see the launch docs

Internal structure

  • Shared file systems are now supported under save and save_state via the ProjectConfiguration dataclass. See #1953 for more info.
  • FSDP can now be used for bfloat16 mixed precision via torch.autocast
  • all_gather_into_tensor is now used as the main gather operation, reducing memory in the cases of big tensors
  • Specifying drop_last=True will now properly have the desired affect when performing Accelerator().gather_for_metrics()

What's Changed

New Contributors

Full Changelog: v0.23.0...v0.24.0

v0.23.0: Model Memory Estimation tool, Breakpoint API, Multi-Node Notebook Launcher Support, and more!

14 Sep 19:23
Compare
Choose a tag to compare

Model Memory Estimator

A new model estimation tool to help calculate how much memory is needed for inference has been added. This does not download the pretrained weights, and utilizes init_empty_weights to stay memory efficient during the calculation.

Usage directions:

accelerate estimate-memory {model_name} --library {library_name} --dtypes fp16 int8

Or:

from accelerate.commands.estimate import estimate_command_parser, estimate_command, gather_data

parser = estimate_command_parser()
args = parser.parse_args(["bert-base-cased", "--dtypes", "float32"])
output = gather_data(args)

🤗 Hub is a first-class citizen

We've made the huggingface_hub library a first-class citizen of the framework! While this is mainly for the model estimation tool, this opens the doors for further integrations should they be wanted

Accelerator Enhancements:

  • gather_for_metrics will now also de-dupe for non-tensor objects. See #1937
  • mixed_precision="bf16" support on NPU devices. See #1949
  • New breakpoint API to help when dealing with trying to break from a condition on a single process. See #1940

Notebook Launcher Enhancements:

  • The notebook launcher now supports launching across multiple nodes! See #1913

FSDP Enhancements:

  • Activation checkpointing is now natively supported in the framework. See #1891
  • torch.compile support was fixed. See #1919

DeepSpeed Enhancements:

  • XPU/ccl support (#1827)
  • Easier gradient accumulation support, simply set gradient_accumulation_steps to "auto" in your deepspeed config, and Accelerate will use the one passed to Accelerator instead (#1901)
  • Support for custom schedulers and deepspeed optimizers (#1909)

What's Changed

New Contributors

Full Changelog: v0.22.0...v0.23.0

v0.22.0: Distributed operation framework, Gradient Accumulation enhancements, FSDP enhancements, and more!

23 Aug 06:26
6b3e559
Compare
Choose a tag to compare

Experimental distributed operations checking framework

A new framework has been introduced which can help catch timeout errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1" to enable this. Read more in the docs, introduced via #1756

Accelerator.load_state can now load the most recent checkpoint automatically

If a ProjectConfiguration has been made, using accelerator.load_state() (without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via #1741

Multiple enhancements to gradient accumulation

In this release multiple new enhancements to distributed gradient accumulation have been added.

  • accelerator.accumulate() now supports passing in multiple models introduced via #1708
  • A util has been introduced to perform multiple forwards, then multiple backwards, and finally sync the gradients only on the last .backward() via #1726

FSDP Changes

  • FSDP support has been added for NPU and XPU devices via #1803 and #1806
  • A new method for supporting RAM-efficient loading of models with FSDP has been added via #1777

DataLoader Changes

  • Custom slice functions are now supported in the DataLoaderDispatcher added via #1846

What's New?

Significant community contributions

The following contributors have made significant changes to the library over the last release:

Full Changelog: v0.21.0...v0.22.0