Skip to content

Commit

Permalink
enable find_executable_batch_size on XPU (#3236)
Browse files Browse the repository at this point in the history
* enable on XPU

* Update src/accelerate/utils/memory.py

Co-authored-by: Benjamin Bossan <[email protected]>

---------

Co-authored-by: Benjamin Bossan <[email protected]>
  • Loading branch information
faaany and BenjaminBossan authored Nov 19, 2024
1 parent 8ade23c commit cf169a1
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 4 deletions.
6 changes: 3 additions & 3 deletions docs/source/basic_tutorials/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,9 +142,9 @@ hostnames for each of the nodes.
mpirun -f hostfile -n {number of nodes} -ppn 1 hostname
```

## CUDA Out-of-Memory
## Out-of-Memory

One of the most frustrating errors when it comes to running training scripts is hitting "CUDA Out-of-Memory". The entire script needs to be restarted and any progress is lost.
One of the most frustrating errors when it comes to running training scripts is hitting "Out-of-Memory" on devices like CUDA, XPU or CPU. The entire script needs to be restarted and any progress is lost.

To address this problem, Accelerate provides the [`find_executable_batch_size`] utility that is heavily based on [toma](https://github.com/BlackHC/toma).
This utility retries code that fails due to OOM (out-of-memory) conditions and automatically lowers batch sizes. For each OOM condition, the algorithm decreases the batch size by half and retries the code until it succeeds.
Expand All @@ -153,7 +153,7 @@ To use [`find_executable_batch_size`], restructure your training function to inc

<Tip warning={true}>

The inner function **must** take batch size as the first parameter, but we do not pass one to it when called. The wrapper will handles this for you. Any object (models, optimizers) that consumes CUDA memory and is passed to the [`Accelerator`] also **must** be declared inside the inner function.
The inner function **must** take batch size as the first parameter, but we do not pass one to it when called. The wrapper will handles this for you. Any object (models, optimizers) that consumes device memory and is passed to the [`Accelerator`] also **must** be declared inside the inner function.

</Tip>

Expand Down
3 changes: 2 additions & 1 deletion src/accelerate/utils/memory.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,14 +92,15 @@ def release_memory(*objects):

def should_reduce_batch_size(exception: Exception) -> bool:
"""
Checks if `exception` relates to CUDA out-of-memory, CUDNN not supported, or CPU out-of-memory
Checks if `exception` relates to CUDA out-of-memory, XPU out-of-memory, CUDNN not supported, or CPU out-of-memory
Args:
exception (`Exception`):
An exception
"""
_statements = [
"CUDA out of memory.", # CUDA OOM
"XPU out of memory.", # XPU OOM
"cuDNN error: CUDNN_STATUS_NOT_SUPPORTED.", # CUDNN SNAFU
"DefaultCPUAllocator: can't allocate memory", # CPU OOM
]
Expand Down

0 comments on commit cf169a1

Please sign in to comment.