Releases · huggingface/accelerate

05 Apr 14:27

v0.29.0

ec88c8f

v0.29.0: NUMA affinity control, MLU Support, and DeepSpeed Improvements

Core

Accelerate can now optimize NUMA affinity, which can help increase throughput on NVIDIA multi-GPU systems. To enable it either follow the prompt during accelerate config, set the ACCELERATE_CPU_AFFINITY=1 env variable, or manually using the following:

from accelerate.utils import set_numa_affinity

# For GPU 0
set_numa_affinity(0)

Big thanks to @stas00 for the recommendation, request, and feedback during development

Allow for setting deterministic algorithms in set_seed by @muellerzr in #2569
Fixed the test script for TPU v2/v3 by @vanbasten23 in #2542
Cambricon MLU device support introduced by @huismiling in #2552
A big refactor was performed to the PartialState and AcceleratorState to allow for easier future-proofing and simplification of adding new devices by @muellerzr in #2576
Fixed a reproducibility issue in distributed environments with Dataloader shuffling when using BatchSamplerShard by @universuen in #2584
notebook_launcher can use multiple GPUs in Google Colab if using a custom instance that supports multiple GPUs by @StefanTodoran in #2561

Big Model Inference

Add log message for RTX 4000 series when performing multi-gpu inference with device_map which can lead to hanging by @SunMarc in #2557
Fix load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in #2588

DeepSpeed

Fix issue with the mapping of main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in #2495
Improve deepspeed env gen by checking for bad keys, by @muellerzr and @ricklamers in #2565
We now support custom deepspeed env files. Like normal deepspeed, set it with the DS_ENV_FILE environmental variable by @muellerzr in #2566
Resolve ZeRO-3 Initialization Failure in already-started distributed environments by @sword865 in #2578

What's Changed

Fix test_script.py on TPU v2/v3 by @vanbasten23 in #2542
Add mapping main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in #2495
split_between_processes for Dataset by @geronimi73 in #2433
Include working driver check by @muellerzr in #2558
🚨🚨🚨Move to using tags rather than latest for docker images and consolidate image repos 🚨 🚨🚨 by @muellerzr in #2554
Add Cambricon MLU accelerator support by @huismiling in #2552
Add NUMA affinity control for NVIDIA GPUs by @muellerzr in #2535
Add log message for RTX 4000 series when performing multi-gpu inference with device_map by @SunMarc in #2557
Improve deepspeed env gen by @muellerzr in #2565
Allow for setting deterministic algorithms by @muellerzr in #2569
Unpin deepspeed by @muellerzr in #2570
Rm uv install by @muellerzr in #2577
Allow for custom deepspeed env files by @muellerzr in #2566
[docs] Missing functions from API by @stevhliu in #2580
Update data_loader.py to Ensure Reproducibility in Multi-Process Environments with Dataloader Shuffle by @universuen in #2584
Refactor affinity and make it stateful by @muellerzr in #2579
Refactor and improve model estimator tool by @muellerzr in #2581
Fix load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in #2588
Guard stateful objects by @muellerzr in #2572
Expound PartialState docstring by @muellerzr in #2589
[docs] Fix kwarg docstring by @stevhliu in #2590
Allow notebook_launcher to launch to multiple GPUs from Colab by @StefanTodoran in #2561
Fix warning log for unused checkpoint keys by @fxmarty in #2594
Resolve ZeRO-3 Initialization Failure in Pre-Set Torch Distributed Environments (huggingface/transformers#28803) by @sword865 in #2578
Refactor PartialState and AcceleratorState by @muellerzr in #2576
Allow for force unwrapping by @muellerzr in #2595
Pin hub for tests by @muellerzr in #2608
Default false for trust_remote_code by @muellerzr in #2607
fix llama example for pippy by @SunMarc in #2616
Fix links in Quick Tour by @muellerzr in #2617
Link to bash in env reporting by @muellerzr in #2623
Unpin hub by @muellerzr in #2625

New Contributors

@asdfry made their first contribution in #2495
@geronimi73 made their first contribution in #2433
@huismiling made their first contribution in #2552
@universuen made their first contribution in #2584
@StefanTodoran made their first contribution in #2561
@sword865 made their first contribution in #2578

Full Changelog: v0.28.0...v0.29.0

Contributors

sword865, ricklamers, and 11 other contributors

Assets 2

12 Mar 16:58

muellerzr

v0.28.0

9e72c61

v0.28.0: DataLoaderConfig, XLA improvements, FSDP + QLORA foundations, Gradient Synchronization Tweaks, and Bug Fixes

Core

Introduce a DataLoaderConfiguration and begin deprecation of arguments in the Accelerator

+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)

Allow gradients to be synced each data batch while performing gradient accumulation, useful when training in FSDP by @fabianlim in #2531

from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+    num_steps=2, 
    sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)

Torch XLA

Support for XLA on the GPU by @anw90 in #2176
Enable gradient accumulation on TPU in #2453

FSDP

Support downstream FSDP + QLORA support through tweaks by allowing configuration of buffer precision by @pacman100 in #2544

`launch` changes

Support mpirun for multi-cpu training by @dmsuehir in #2493

What's Changed

Fix model metadata issue check by @muellerzr in #2435
Use py 3.9 by @muellerzr in #2436
Fix seedable sampler logic and expound docs by @muellerzr in #2434
Fix tied_pointers_to_remove type by @fxmarty in #2439
Make test assertions more idiomatic by @akx in #2420
Prefer is_torch_tensor over hasattr for torch.compile. by @PhilJd in #2387
Enable more Ruff lints & fix issues by @akx in #2419
Fix warning when dispatching model by @SunMarc in #2442
Make torch xla available on GPU by @anw90 in #2176
Include pippy_file_path by @muellerzr in #2444
[Big deprecation] Introduces a DataLoaderConfig by @muellerzr in #2441
Check for None by @muellerzr in #2452
Fix the pytest version to be less than 8.0.1 by @BenjaminBossan in #2461
Fix wrong is_namedtuple implementation by @fxmarty in #2475
Use grad-accum on TPU by @muellerzr in #2453
Add pre-commit configuration by @akx in #2451
Replace os.path.sep.join path manipulations with a helper by @akx in #2446
DOC: Fixes to Accelerator docstring by @BenjaminBossan in #2443
Context manager fixes by @akx in #2450
Fix TPU with new XLA device type by @will-cromar in #2467
Free mps memory by @SunMarc in #2483
[FIX] allow Accelerator to detect distributed type from the "LOCAL_RANK" env variable for XPU by @faaany in #2473
Fix CI tests due to pathlib issues by @muellerzr in #2491
Remove all cases of torchrun in tests and centralize as accelerate launch by @muellerzr in #2498
Fix link typo by @SunMarc in #2503
[docs] Accelerator API by @stevhliu in #2465
Docstring fixup by @muellerzr in #2504
[docs] Divide training and inference by @stevhliu in #2466
add custom dtype INT2 by @SunMarc in #2505
quanto compatibility for cpu/disk offload by @SunMarc in #2481
[docs] Quicktour by @stevhliu in #2456
Check if hub down by @muellerzr in #2506
Remove offline stuff by @muellerzr in #2509
Fixed 0MiB bug in convert_file_size_to_int by @StoyanStAtanasov in #2507
Fix edge case in infer_auto_device_map when dealing with buffers by @SunMarc in #2511
[docs] Fix typos by @omahs in #2490
fix typo in launch.py (----main_process_port to --main_process_port) by @DerrickWang005 in #2516
Add copyright + some ruff lint things by @muellerzr in #2523
Don't manage PYTORCH_NVML_BASED_CUDA_CHECK when calling accelerate.utils.imports.is_cuda_available() by @luiscape in #2524
Quanto compatibility with QBitsTensor by @SunMarc in #2526
Remove unnecessary env=os.environ.copy()s by @akx in #2449
Launch mpirun from accelerate launch for multi-CPU training by @dmsuehir in #2493
Enable using dash or underscore for CLI args by @muellerzr in #2527
Update the default behavior of zero_grad(set_to_none=None) to align with PyTorch by @yongchanghao in #2472
Update link to dynamo/compile doc by @WarmongeringBeaver in #2533
Check if the buffers fit GPU memory after device map auto inferred by @notsyncing in #2412
[Refactor] Refactor send_to_device to treat tensor-like first by @vmoens in #2438
Overdue email change... by @muellerzr in #2534
[docs] Troubleshoot by @stevhliu in #2538
Remove extra double-dash in error message by @drscotthawley in #2541
Allow Gradients to be Synced Each Data Batch While Performing Gradient Accumulation by @fabianlim in #2531
Update FSDP mixed precision setter to enable fsdp+qlora by @pacman100 in #2544
Use uv instead of pip install for github CI by @muellerzr in #2546

New Contributors

@anw90 made their first contribution in #2176
@StoyanStAtanasov made their first contribution in #2507
@omahs made their first contribution in #2490
@DerrickWang005 made their first contribution in #2516
@luiscape made their first contribution in #2524
@dmsuehir made their first contribution in #2493
@yongchanghao made their first contribution in #2472
@WarmongeringBeaver made their first contribution in #2533
@vmoens made their first contribution in #2438
@drscotthawley made their first contribution in #2541
@fabianlim made their first contribution in #2531

Full Changelog: v0.27.2...v0.28.0

Contributors

akx, luiscape, and 20 other contributors

Assets 2

09 Feb 16:30

muellerzr

v0.27.0

b7087be

v0.27.0: PyTorch 2.2.0 Support, PyTorch-Native Pipeline Parallism, DeepSpeed XPU support, and Bug Fixes

PyTorch 2.2.0 Support

With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it

PyTorch-Native Pipeline Parallel Inference

With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto". This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.

Requires pippy of version 0.2.0 or later (pip install torchpippy -U)

Example usage (combined with accelerate launch or torchrun):

from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
    output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
    output = torch.stack(tuple(output[0]))
    print(output.shape)

DeepSpeed

This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany

What's Changed

Convert model.hf_device_map back to Dict by @SunMarc in #2326
Fix model memory issue by @muellerzr in #2327
Fixed typos in readme files of docs folder. by @rishit5 in #2329
Disable P2P in just the 4000 series by @muellerzr in #2332
Avoid duplicating memory for tied weights in dispatch_model, and in forward with offloading by @fxmarty in #2330
Show DeepSpeed option when multi-XPU is selected in accelerate config by @faaany in #2346
FIX: add oneCCL environment variable for non-MPI launcher (accelerate launch) by @faaany in #2339
device agnostic test_accelerator/test_multigpu by @wangshuai09 in #2343
Fix mpi4py/failing deepspeed test issues by @muellerzr in #2353
Fix block_size picking in megatron_lm_gpt_pretraining example. by @nilq in #2342
Fix dispatch_model with tied weights test on T4 by @fxmarty in #2354
bugfix to allow usage of TE or MSAMP in FP8RecipeKwargs by @sudhakarsingh27 in #2355
Pin DeepSpeed until patch by @muellerzr in #2366
Remove init_hook_kwargs by @fxmarty in #2365
device agnostic optimizer testing by @statelesshz in #2363
add_hook_to_module and remove_hook_from_module compatibility with fx.GraphModule by @fxmarty in #2369
Adding requires_grad to kwargs when registering empty parameters. by @BlackSamorez in #2376
Add adapter_only option to save_fsdp_model and load_fsdp_model to only save/load PEFT weights by @AjayP13 in #2321
device agnostic cli/data_loader/grad_sync/kwargs_handlers/memory_utils testing by @wangshuai09 in #2356
Fix batch_size sanity check logic for split_batches by @izhx in #2344
Pin Torch version to <2.2.0 by @Rocketknight1 in #2394
Address PIP-632 deprecation of distutils by @AieatAssam in #2388
[don't merge yet] unpin torch by @ydshieh in #2406
Revert "[don't merge yet] unpin torch" by @muellerzr in #2407
Fix CI due to pytest by @muellerzr in #2408
Added activateEnviroment.sh to readme by @TJ-Solergibert in #2409
Fix XPU inference by @notsyncing in #2383
Fix the size of int and bool type when computing module size by @notsyncing in #2411
Adding Local SGD support for NPU by @statelesshz in #2415
Unpin torch by @muellerzr in #2418
Use Ruff for formatting too by @akx in #2400
torch-native pipeline parallelism for big models by @muellerzr in #2345
Update FSDP docs by @pacman100 in #2430
Make output end up on all GPUs at the end by @muellerzr in #2423
Migrate pippy examples over and run tests by @muellerzr in #2424
[FIX] fix the wrong nproc_per_node in the multi gpu test by @faaany in #2422
Fix fp8 things by @muellerzr in #2403
[FIX] allow Accelerator to prepare models in eval mode for XPU&CPU by @faaany in #2426
[Fix] make all tests pass on XPU by @faaany in #2427

New Contributors

@rishit5 made their first contribution in #2329
@faaany made their first contribution in #2346
@wangshuai09 made their first contribution in #2343
@nilq made their first contribution in #2342
@BlackSamorez made their first contribution in #2376
@AjayP13 made their first contribution in #2321
@Rocketknight1 made their first contribution in #2394
@AieatAssam made their first contribution in #2388
@ydshieh made their first contribution in #2406
@notsyncing made their first contribution in #2383
@akx made their first contribution in #2400

Full Changelog: v0.26.1...v0.27.0

Contributors

akx, ydshieh, and 17 other contributors

Assets 2

11 Jan 15:26

muellerzr

v0.26.1

6f05bbd

v0.26.1: Patch Release

What's Changed

Raise error when using batches of different sizes with dispatch_batches=True by @SunMarc in #2325

Full Changelog: v0.26.0...v0.26.1

Contributors

SunMarc

Assets 2

11 Jan 14:55

muellerzr

v0.26.0

e09c724

v0.26.0 - MS-AMP Support, Critical Regression Fixes, and More

Support for MS-AMP

This release adds support for the MS-AMP (Microsoft Automatic Mixed Precision Library) into Accelerate as an alternative backend for doing FP8 training on appropriate hardware. It is the default backend of choice. Read more in the docs here. Introduced in #2232 by @muellerzr

Core

In the prior release a new sampler for the DataLoader was introduced that while across seeds does not show statistical differences in the results, repeating the same seed would result in a different end-accuracy that was scary to some users. We have now disabled this behavior by default as it required some additional setup, and brought back the original implementation. To have the new sampling technique (which can provide more accurate repeated results) pass use_seedable_sampler=True to the Accelerator. We will be propagating this up to the Trainer soon.

Big Model Inference

NPU support was added thanks to @statelesshz in #2222
When generating an automatic device_map we've made it possible to not returned grouped key results if desired in #2233
We now handle corner cases better when users pass device_map="cuda" etc thanks to @younesbelkada in #2254

FSDP and DeepSpeed

Many improvements to the docs have been made thanks to @stass. Along with this we've made it easier to adjust the config for the sharding strategy and other config values thanks to @pacman100 in #2288
A regression in Accelerate 0.23.0 occurred that showed learning is much slower on multi-GPU setups compared to a single GPU. #2304 has now fixed this thanks to @pacman100
The DeepSpeed integration now also handles auto values better when making a configuration in #2313

Bits and Bytes

Params4bit added to bnb classes in set_module_tensor_to_device() by @poedator in #2315

Device Agnostic Testing

For developers, we've made it much easier to run the tests on different devices with no change to the code thanks to @statelesshz in #2123 and #2235

Bug Fixes

Check notebook launcher for 3090+ by @muellerzr in #2212
Fix dtype bug when offload_state_dict=True and dtype is specified by @fxmarty in #2116
fix tqdm wrapper to print when process id ==0 by @kashif in #2223
fix BFloat16 is not supported on MPS (#2226) by @jxysoft in #2227
Fix MpDeviceLoaderWrapper not having attribute batch_sampler by @vanbasten23 in #2242
[deepspeed] fix setting auto values for comm buffers by @stas00 in #2295
Fix infer_auto_device_map when tied weights share the same prefix name by @fxmarty in #2324
Fixes bug in swapping weights when replacing with Transformer-Engine layers by @sudhakarsingh27 in #2305
Fix breakpoint API in test_script.py on TPU. by @vanbasten23 in #2263
Bring old seed technique back by @muellerzr in #2319

Major Contributors

@statelesshz for their work on device-agnostic testing and NPU support
@stas00 for many docfixes when it comes to DeepSpeed and FSDP

General Changelog

add missing whitespace by @stas00 in #2206
MNT Delete the delete doc workflows by @BenjaminBossan in #2217
Update docker images by @muellerzr in #2213
Add allgather check for xpu by @abhilash1910 in #2199
Check notebook launcher for 3090+ by @muellerzr in #2212
Fix dtype bug when offload_state_dict=True and dtype is specified by @fxmarty in #2116
fix tqdm wrapper to print when process id ==0 by @kashif in #2223
[data_loader] expand the error message by @stas00 in #2221
Update the 'Frameworks using Accelerate' section to include Amphion by @RMSnow in #2225
[Docs] Add doc for cpu/disk offload by @SunMarc in #2231
device agnostic testing by @statelesshz in #2123
Make cleaning optional for device map by @muellerzr in #2233
Add npu support to big model inference by @statelesshz in #2222
fix the DS failing test by @pacman100 in #2237
Fix nb tests by @muellerzr in #2230
fix BFloat16 is not supported on MPS (#2226) by @jxysoft in #2227
Fix MpDeviceLoaderWrapper not having attribute batch_sampler by @vanbasten23 in #2242
[Big-Modeling] Harmonize device check to handle corner cases by @younesbelkada in #2254
Support log_images for aim tracker by @Justin900429 in #2257
Integrate MS-AMP Support for FP8 as a seperate backend by @muellerzr in #2232
refactor deepspeed dataloader prepare logic by @pacman100 in #2238
device agnostic deepspeed&fsdp testing by @statelesshz in #2235
Solve CUDA issues by @muellerzr in #2272
Uninstall DVC in the Trainer tests by @muellerzr in #2271
Rm DVCLive from test reqs as latest version causes failures by @muellerzr in #2279
typo fix by @stas00 in #2276
Add condition before using check_tied_parameters_on_same_device by @SunMarc in #2218
[doc] FSDP improvements by @stas00 in #2274
[deepspeed docs] auto-values aren't being covered by @stas00 in #2286
Improve FSDP config usability by @pacman100 in #2288
[doc] language fixes by @stas00 in #2292
Bump tj-actions/changed-files from 22.2 to 41 in /.github/workflows by @dependabot in #2300
add back dvclive to tests by @dberenbaum in #2280
Fixes bug in swapping weights when replacing with Transformer-Engine layers by @sudhakarsingh27 in #2305
Fix breakpoint API in test_script.py on TPU. by @vanbasten23 in #2263
make test_state_checkpointing device agnostic by @statelesshz in #2290
[deepspeed] documentation by @stas00 in #2296
Add more missing items by @muellerzr in #2309
Update docs: Add warning for device_map=None for load_checkpoint_and_dispatch by @PhilJd in #2308
[deepspeed] fix setting auto values for comm buffers by @stas00 in #2295
DeepSpeed refactoring by @pacman100 in #2313
Fix DeepSpeed related regression by @pacman100 in #2304
Update test_deepspeed.py by @pacman100 in #2323
Bring old seed technique back by @muellerzr in #2319
Fix batch_size sanity check in prepare_data_loader by @izhx in #2310
Params4bit added to bnb classes in set_module_tensor_to_device() by @poedator in #2315
Fix infer_auto_device_map when tied weights share the same prefix name by @fxmarty in #2324

New Contributors

@fxmarty made their first contribution in #2116
@RMSnow made their first contribution in #2225
@jxysoft made their first contribution in #2227
@vanbasten23 made their first contribution in #2242
@Justin900429 made their first contribution in #2257
@dependabot made their first contribution in #2300
@sudhakarsingh27 ma...

Contributors

kashif, stass, and 19 other contributors

Assets 2

01 Dec 15:24

muellerzr

v0.25.0

d08c23c

v0.25.0: safetensors by default, new trackers, and plenty of bug fixes

Safetensors default

As of this release, safetensors will be the default format saved when applicable! To read more about safetensors and why it's best to use it for safety (and not pickle/torch.save), check it out here

New Experiment Trackers

This release has two new experiment trackers, ClearML and DVCLive!

To use them, just pass clear_ml or dvclive to log_with in the Accelerator init. h/t to @eugen-ajechiloae-clearml and @dberenbaum

DeepSpeed

Accelerate's DeepSpeed integration now supports NPU devices, h/t to @statelesshz
DeepSpeed can now be launched via accelerate on single GPU setups

FSDP

FSDP had a huge refactoring so that the interface when using FSDP is the exact same as every other scenario when using accelerate. No more needing to call accelerator.prepare() twice!

Other useful enhancements

We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using accelerate launch we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error.
When doing .gather(), if tensors are on different devices we explicitly will raise an error (for now only valid on CUDA)

Bug fixes

Fixed a bug that caused dataloaders to not shuffle despite shuffle=True when using multiple GPUs and the new SeedableRandomSampler.

General Changelog

Add logs offloading by @SunMarc in #2075
Add ClearML tracker by @eugen-ajechiloae-clearml in #2034
CRITICAL: fix failing ci by @muellerzr in #2088
Fix flag typo by @kuza55 in #2090
Fix batch sampler by @muellerzr in #2097
fixed ip address typo by @Fluder-Paradyne in #2099
Fix memory leak in fp8 causing OOM (and potentially 3x vRAM usage) by @muellerzr in #2089
fix warning when offload by @SunMarc in #2105
Always use SeedableRandomSampler by @muellerzr in #2110
Fix issue with tests by @muellerzr in #2111
Make SeedableRandomSampler the default always by @muellerzr in #2117
Use "and" instead of comma in Bibtex citation by @qgallouedec in #2119
Add explicit error if empty batch received by @YuryYakhno in #2115
Allow for ACCELERATE_SEED env var by @muellerzr in #2126
add DeepSpeed support for NPU by @statelesshz in #2054
Sync states for npu fsdp by @jq460494839 in #2113
Fix import error when torch>=2.0.1 and torch.distributed is disabled by @natsukium in #2121
Make safetensors the default by @muellerzr in #2120
Raise error when saving with param on meta device by @SunMarc in #2132
Leave native save as False by @muellerzr in #2138
fix retie_parameters by @SunMarc in #2137
Deal with shared memory scenarios by @muellerzr in #2136
specify config file path on README by @kwonmha in #2140
Fix safetensors contiguous by @SunMarc in #2145
Fix more tests by @muellerzr in #2146
[docs] fixed a couple of broken links by @MKhalusova in #2147
[docs] troubleshooting guide by @MKhalusova in #2133
[Docs] fix doc typos by @kashif in #2150
Add note about GradientState being in-sync with the dataloader by default by @muellerzr in #2134
Deprecated runner stuff by @muellerzr in #2152
Add examples to tests by @muellerzr in #2131
Disable pypi for merge workflows + fix trainer tests by @muellerzr in #2153
Adds dvclive tracker by @dberenbaum in #2139
check port availability only in main deepspeed/torchrun launcher by @Jingru in #2078
Do not attempt to pad nested tensors by @frankier in #2041
Add warning for problematic libraries by @muellerzr in #2151
Add ZeRO++ to DeepSpeed usage docs by @SumanthRH in #2166
Fix Megatron-LM Arguments Bug by @yuanenming in #2168
Fix non persistant buffer dispatch by @SunMarc in #1941
Updated torchrun instructions by @TJ-Solergibert in #2096
New CI Runners by @muellerzr in #2087
Revert "New CI Runners" by @muellerzr in #2172
[Working again] New CI by @muellerzr in #2173
fsdp refactoring by @pacman100 in #2177
Pin DVC by @muellerzr in #2196
Apply DVC warning to Accelerate by @muellerzr in #2197
Explicitly disable P2P using launch, and pick up in state if a user will face issues. by @muellerzr in #2195
Better error when device mismatches when calling gather() on CUDA by @muellerzr in #2180
unpins dvc by @dberenbaum in #2200
Assemble state dictionary for offloaded models by @blbadger in #2156
Allow deepspeed without distributed launcher by @pacman100 in #2204

New Contributors

@eugen-ajechiloae-clearml made their first contribution in #2034
@kuza55 made their first contribution in #2090
@Fluder-Paradyne made their first contribution in #2099
@YuryYakhno made their first contribution in #2115
@jq460494839 made their first contribution in #2113
@kwonmha made their first contribution in #2140
@dberenbaum made their first contribution in #2139
@Jingru made their first contribution in #2078
@frankier made their first contribution in #2041
@yuanenming made their first contribution in #2168
@TJ-Solergibert made their first contribution in #2096
@blbadger made their first contribution in #2156

Full Changelog: v0.24.1...v0.25.0

Contributors

kashif, frankier, and 19 other contributors

Assets 2

30 Oct 14:12

muellerzr

v0.24.1

8d1479d

v0.24.1: Patch Release for Samplers

Fixes #2091 by changing how checking for custom samplers is done

Assets 2

24 Oct 17:37

muellerzr

v0.24.0

00301b2

v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements

Improved Reproducibility

One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch() function to all Accelerate DataLoaders, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:

dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)

For more information see this PR, we will update the docs on a subsequent release with more information on this API.

Documentation

The quick tour docs have gotten a complete makeover thanks to @MKhalusova. Take a look here
We also now have documentation on how to perform multinode training, see the launch docs

Internal structure

Shared file systems are now supported under save and save_state via the ProjectConfiguration dataclass. See #1953 for more info.
FSDP can now be used for bfloat16 mixed precision via torch.autocast
all_gather_into_tensor is now used as the main gather operation, reducing memory in the cases of big tensors
Specifying drop_last=True will now properly have the desired affect when performing Accelerator().gather_for_metrics()

What's Changed

Update big_modeling.md by @kli-casia in #1976
Fix model copy after dispatch_model by @austinapatel in #1971
FIX: Automatic checkpoint path inference issue by @BenjaminBossan in #1989
Fix skip first batch for deepspeed example by @SumanthRH in #2001
[docs] Quick tour refactor by @MKhalusova in #2008
Add basic documentation for multi node training by @SumanthRH in #1988
update torch_dynamo backends by @SunMarc in #1992
Sync states for xpu fsdp by @abhilash1910 in #2005
update fsdp docs by @pacman100 in #2026
Enable shared file system with save and save_state via ProjectConfiguration by @muellerzr in #1953
Fix save on each node by @muellerzr in #2036
Allow FSDP to use with torch.autocast for bfloat16 mixed precision by @brcps12 in #2033
Fix DeepSpeed version to <0.11 by @BenjaminBossan in #2043
Unpin deepspeed by @muellerzr in #2044
Reduce memory by using all_gather_into_tensor by @muellerzr in #1968
Safely end training even if trackers weren't initialized by @Ben-Epstein in #1994
Fix integration CI by @muellerzr in #2047
Make fsdp ram efficient loading optional by @pacman100 in #2037
Let drop_last modify gather_for_metrics by @muellerzr in #2048
fix docstring by @zhangsibo1129 in #2053
Fix stalebot by @muellerzr in #2052
Add space to docs by @muellerzr in #2055
Fix the error when the "train_batch_size" is absent in DeepSpeed config by @LZHgrla in #2060
remove unused constants by @statelesshz in #2045
fix: remove useless token by @rtrompier in #2069
DOC: Fix broken link to designing a device map by @BenjaminBossan in #2073
Let iterable dataset shard have a length if implemented by @muellerzr in #2066
Allow for samplers to be seedable and reproducable by @muellerzr in #2057
Fix docstring typo by @qgallouedec in #2072
Warn when kernel version is too low on Linux by @BenjaminBossan in #2077

New Contributors

@kli-casia made their first contribution in #1976
@MKhalusova made their first contribution in #2008
@brcps12 made their first contribution in #2033
@Ben-Epstein made their first contribution in #1994
@zhangsibo1129 made their first contribution in #2053
@LZHgrla made their first contribution in #2060
@rtrompier made their first contribution in #2069
@qgallouedec made their first contribution in #2072

Full Changelog: v0.23.0...v0.24.0

Contributors

MKhalusova, rtrompier, and 14 other contributors

Assets 2

14 Sep 19:23

muellerzr

v0.23.0

48d9631

v0.23.0: Model Memory Estimation tool, Breakpoint API, Multi-Node Notebook Launcher Support, and more!

Model Memory Estimator

A new model estimation tool to help calculate how much memory is needed for inference has been added. This does not download the pretrained weights, and utilizes init_empty_weights to stay memory efficient during the calculation.

Usage directions:

accelerate estimate-memory {model_name} --library {library_name} --dtypes fp16 int8

Or:

from accelerate.commands.estimate import estimate_command_parser, estimate_command, gather_data

parser = estimate_command_parser()
args = parser.parse_args(["bert-base-cased", "--dtypes", "float32"])
output = gather_data(args)

🤗 Hub is a first-class citizen

We've made the huggingface_hub library a first-class citizen of the framework! While this is mainly for the model estimation tool, this opens the doors for further integrations should they be wanted

`Accelerator` Enhancements:

gather_for_metrics will now also de-dupe for non-tensor objects. See #1937
mixed_precision="bf16" support on NPU devices. See #1949
New breakpoint API to help when dealing with trying to break from a condition on a single process. See #1940

Notebook Launcher Enhancements:

The notebook launcher now supports launching across multiple nodes! See #1913

FSDP Enhancements:

Activation checkpointing is now natively supported in the framework. See #1891
torch.compile support was fixed. See #1919

DeepSpeed Enhancements:

XPU/ccl support (#1827)
Easier gradient accumulation support, simply set gradient_accumulation_steps to "auto" in your deepspeed config, and Accelerate will use the one passed to Accelerator instead (#1901)
Support for custom schedulers and deepspeed optimizers (#1909)

What's Changed

Update release instructions by @sgugger in #1877
fix detach_hook by @SunMarc in #1880
Enable power users to bypass device_map="auto" training block by @muellerzr in #1881
Introduce model memory estimator by @muellerzr in #1876
Update with new url for explore by @muellerzr in #1884
Enable a token to be used by @muellerzr in #1886
Add doc on model memory usage by @muellerzr in #1887
Add hub as core dep by @muellerzr in #1885
update import of deepspeed integration from transformers by @pacman100 in #1894
Final nits on model util by @muellerzr in #1896
Fix nb launcher test by @muellerzr in #1899
Add FSDP activation checkpointing feature by @arde171 in #1891
Solve at least one failing test by @muellerzr in #1898
Deepspeed integration for XPU/ccl by @abhilash1910 in #1827
Add PR template by @muellerzr in #1906
deepspeed grad_acc_steps fixes by @pacman100 in #1901
Skip pypi transformers until release by @muellerzr in #1911
Fix docker images by @muellerzr in #1910
Use hosted CI runners for building docker images by @muellerzr in #1915
fix: add debug argument to sagemaker configuration by @maximegmd in #1904
improve help info when run accelerate config on npu by @statelesshz in #1895
support logging with mlflow in case of mlflow-skinny installed by @ghtaro in #1874
More CI fun - run all test parts always by @muellerzr in #1916
Expose auto in dataclass by @muellerzr in #1914
Add support for deepspeed optimizer and custom scheduler by @pacman100 in #1909
reduce gradient first for XLA when unscaling the gradients in mixed precision training with AMP. by @statelesshz in #1926
Check for invalid keys by @muellerzr in #1935
clean num devices by @SunMarc in #1936
Bring back pypi to runners by @muellerzr in #1939
Support multi-node notebook launching by @ggaaooppeenngg in #1913
fix the fsdp docs by @pacman100 in #1947
Fix docs by @ggaaooppeenngg in #1951
Protect tensorflow dependency by @SunMarc in #1959
fix safetensor saving by @SunMarc in #1954
FIX: patch_environment restores pre-existing environment variables when finished by @BenjaminBossan in #1960
Better guards for slow imports by @muellerzr in #1963
[Tests] Finish all todos by @younesbelkada in #1957
Rm strtobool by @muellerzr in #1964
Implementing gather_for_metrics with dedup for non tensor objects by @Lorenzobattistela in #1937
add bf16 mixed precision support for NPU by @statelesshz in #1949
Introduce breakpoint API by @muellerzr in #1940
fix torch compile with FSDP by @pacman100 in #1919
Add force_hooks to dispatch_model by @austinapatel in #1969
update FSDP and DeepSpeed docs by @pacman100 in #1973
Flex fix patch for accelerate by @abhilash1910 in #1972
Remove checkpoints only on main process by @Kepnu4 in #1974

New Contributors

@arde171 made their first contribution in #1891
@maximegmd made their first contribution in #1904
@ghtaro made their first contribution in #1874
@ggaaooppeenngg made their first contribution in #1913
@Lorenzobattistela made their first contribution in #1937
@austinapatel made their first contribution in #1969
@Kepnu4 made their first contribution in #1974

Full Changelog: v0.22.0...v0.23.0

Contributors

maximegmd, Kepnu4, and 13 other contributors

Assets 2

23 Aug 06:26

muellerzr

v0.22.0

6b3e559

v0.22.0: Distributed operation framework, Gradient Accumulation enhancements, FSDP enhancements, and more!

Experimental distributed operations checking framework

A new framework has been introduced which can help catch timeout errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1" to enable this. Read more in the docs, introduced via #1756

`Accelerator.load_state` can now load the most recent checkpoint automatically

If a ProjectConfiguration has been made, using accelerator.load_state() (without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via #1741

Multiple enhancements to gradient accumulation

In this release multiple new enhancements to distributed gradient accumulation have been added.

accelerator.accumulate() now supports passing in multiple models introduced via #1708
A util has been introduced to perform multiple forwards, then multiple backwards, and finally sync the gradients only on the last .backward() via #1726

FSDP Changes

FSDP support has been added for NPU and XPU devices via #1803 and #1806
A new method for supporting RAM-efficient loading of models with FSDP has been added via #1777

DataLoader Changes

Custom slice functions are now supported in the DataLoaderDispatcher added via #1846

What's New?

fix failing test on 8GPU by @statelesshz in #1724
Better control over DDP's no_sync by @NouamaneTazi in #1726
Get rid of calling get_scale() by patching the step method of optimizer. by @yuxinyuan in #1720
fix the bug in npu by @statelesshz in #1728
Adding a shape check for set_module_tensor_to_device. by @Narsil in #1731
Fix errors when optimizer is not a Pytorch optimizer. by @yuxinyuan in #1733
Make balanced memory able to work with non contiguous GPUs ids by @thomwolf in #1734
Fixed typo in __repr__ of AlignDevicesHook by @KacperWyrwal in #1735
Update docs by @muellerzr in #1736
Fixed the bug that split dict incorrectly by @yuangpeng in #1742
Let load_state automatically grab the latest save by @muellerzr in #1741
fix KwargsHandler.to_kwargs not working with os.environ initialization in __post_init__ by @CyCle1024 in #1738
fix typo by @cauyxy in #1747
Check for misconfiguration of single node & single GPU by @muellerzr in #1746
Remove unused constant by @muellerzr in #1749
Rework new constant for operations by @muellerzr in #1748
Expose autocast kwargs and simplify autocast wrapper by @muellerzr in #1740
Fix FSDP related issues by @pacman100 in #1745
FSDP enhancements and fixes by @pacman100 in #1753
Fix check failure in Accelerator.save_state using multi-gpu by @CyCle1024 in #1760
Fix error when max_memory argument is in unexpected order by @ranchlai in #1759
Fix offload on disk when executing on CPU by @sgugger in #1762
Change is_aim_available() function to not match aim >= 4.0.0 by @alberttorosyan in #1769
Introduce an experimental distributed operations framework by @muellerzr in #1756
Support wrapping multiple models in Accelerator.accumulate() by @yuxinyuan in #1708
Contigous on gather by @muellerzr in #1771
[FSDP] Fix load_fsdp_optimizer by @awgu in #1755
simplify and correct the deepspeed example by @pacman100 in #1775
Set ipex default in state by @muellerzr in #1776
Fix import error when torch>=2.0.1 and torch.distributed is disabled by @natsukium in #1800
reserve 10% GPU in get_balanced_memory to avoid OOM by @ranchlai in #1798
add support of float memory size in convert_file_size_to_int by @ranchlai in #1799
Allow users to resume from previous wandb runs with allow_val_change by @SumanthRH in #1796
Add FSDP for XPU by @abhilash1910 in #1803
Add FSDP for NPU by @statelesshz in #1806
Fix pytest import by @muellerzr in #1808
More specific logging in gather_for_metrics by @dleve123 in #1784
Detect device map auto and raise a helpful error when trying to not use model parallelism by @muellerzr in #1810
Typo fix by @muellerzr in #1812
Expand device-map warning by @muellerzr in #1819
Update bibtex to reflect team growth by @muellerzr in #1820
Improve docs on grad accumulation by @vwxyzjn in #1817
add warning when using to and cuda by @SunMarc in #1790
Fix bnb import by @muellerzr in #1813
Update docs and docstrings to match load_and_quantize_model arg by @JonathanRayner in #1822
Expose a bit of args/docstring fixup by @muellerzr in #1824
Better test by @muellerzr in #1825
Minor idiomatic change for fp8 check. by @float-trip in #1829
Use device as context manager for init_on_device by @shingjan in #1826
Ipex bug fix for device properties in modelling by @abhilash1910 in #1834
FIX: Bug with unwrap_model and keep_fp32_wrapper=False by @BenjaminBossan in #1838
Fix verify_device_map by @Rexhaif in #1842
Change CUDA check by @muellerzr in #1833
Fix the noneffective parameter: gpu_ids (Rel. Issue #1848) by @devymex in #1850
support for ram efficient loading of model with FSDP by @pacman100 in #1777
Loading logic safetensors by @SunMarc in #1853
fix dispatch for quantized model by @SunMarc in #1855
Update fsdp_with_peak_mem_tracking.py by @pacman100 in #1856
Add env variable for init_on_device by @shingjan in #1852
remove casting to FP32 when saving state dict by @pacman100 in #1868
support custom slice function in DataLoaderDispatcher by @thevasudevgupta in #1846
Include a note to the forums in the bug report by @muellerzr in #1871

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@yuxinyuan
- Support wrapping multiple models in Accelerator.accumulate() (#1708)
- Fix errors when optimizer is not a Pytorch optimizer. (#1733)
- Get rid of calling get_scale() by patching the step method of optimizer. (#1720)
@NouamaneTazi
- Better control over DDP's no_sync (#1726)
@abhilash1910
- Add FSDP for XPU (#1803)
- Ipex bug fix for device properties in modelling (#1834)
@statelesshz
- Add FSDP for NPU (#1806)
- fix failing test on 8GPU (#1724)
- fix the bug in npu (#1728)
@thevasudevgupta
- support custom slice function in DataLoaderDispatcher (#1846)