You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For testing this is running with two mpi-connected docker images (based on nvcr.io/nvidia/pytorch:24.09-py3) on an Apple M3 Max running macOS 15
Information
The official example scripts
My own modified scripts
Tasks
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
On an MPI system with more than one node run accelerate launch --config_file=config_mpi.yaml nlp_example.py --cpu
wait for the run to crash
get this output:
[rank0]: Traceback (most recent call last):
[rank0]: File "/accelerate/nlp_example.py", line 209, in <module>
[rank0]: main()
[rank0]: File "/accelerate/nlp_example.py", line 205, in main
[rank0]: training_function(config, args)
[rank0]: File "/accelerate/nlp_example.py", line 179, in training_function
[rank0]: predictions, references = accelerator.gather_for_metrics((predictions, batch["labels"]))
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics
[rank0]: data = self.gather(input_data)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2456, in gather
[rank0]: return gather(tensor)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 398, in wrapper
[rank0]: return function(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 437, in gather
[rank0]: return _gpu_gather(tensor)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 356, in _gpu_gather
[rank0]: return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 108, in recursively_apply
[rank0]: return honor_type(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 82, in honor_type
[rank0]: return type(obj)(generator)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 111, in <genexpr>
[rank0]: recursively_apply(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 127, in recursively_apply
[rank0]: return func(data, *args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 346, in _gpu_gather_one
[rank0]: gather_op(output_tensors, tensor)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3410, in all_gather_into_tensor
[rank0]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]: RuntimeError: no support for _allgather_base in MPI process group
If I switch out MPI from under accelerate the example runs without giving an error message.
I do this by adding the following:
System Info
This is my config file:
For testing this is running with two mpi-connected docker images (based on
nvcr.io/nvidia/pytorch:24.09-py3
) on an Apple M3 Max running macOS 15Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
accelerate launch --config_file=config_mpi.yaml nlp_example.py --cpu
If I switch out MPI from under accelerate the example runs without giving an error message.
I do this by adding the following:
Expected behavior
I expect it not to crash.
It works fine if number of machines is 1.
The text was updated successfully, but these errors were encountered: