Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline arguments are not matched #1130

Open
rednoah91 opened this issue Jul 12, 2024 · 8 comments
Open

pipeline arguments are not matched #1130

rednoah91 opened this issue Jul 12, 2024 · 8 comments

Comments

@rednoah91
Copy link

Hi,

I followed the instructions to install pytorch for pipelining:

pip install -r requirements.txt --find-links https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

I have version 2.4.0.dev20240605+cpu installed. When I execute torchrun --nproc-per-node 4 pippy_gpt2.py I got errors:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_gpt2.py", line 118, in <module>
[rank0]:     run(args)
[rank0]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_gpt2.py", line 50, in run
[rank0]:     pipe = pipeline(
[rank0]: TypeError: pipeline() got an unexpected keyword argument 'mb_args'

Seems the arguments of pipeline are not matched. Did you ever encounter this? Which pytorch version are you using?

Thanks
Hong-Rong

@kwen2501
Copy link
Contributor

Hi thanks for trying.
It seems the pytorch install link you used is outdated, that's why you only got up to dev20240605 version.
And there are some API changes between 0605 and our release version in torch 2.4.
Can you please try this link instead if you want to use nightly version?

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121

Or you could install the release version if you want it to be stable. (Removing --pre flag from above).

Note:
our library currently only supports CUDA device, hence the cu121 in the above link. CPU support is not stable due to weak P2P support of Gloo.

cc: @wconstab @H-Huang

@rednoah91
Copy link
Author

Thanks for the information.
I've tried pip install https://download.pytorch.org/whl/nightly/cpu/torch-2.4.0.dev20240612%2Bcpu-cp310-cp310-linux_x86_64.whl on my ubuntu machine and it works.

@rednoah91
Copy link
Author

Note:
our library currently only supports CUDA device, hence the cu121 in the above link. CPU support is not stable due to weak P2P support of Gloo.

I ran pippy_gpt2.py passed, but pippy_bert.py got the following error: (looks something is stuck untill timeout)
Maybe this echo to what you are saying the Gloo support is weak(?). Do you have plan to complement the CPU support?

Pipeline stage 1 21M params
Pipeline stage 3 21M params
Pipeline stage 2 21M params
Pipeline stage 0 45M params
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank0]:     run(args)
[rank0]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 72, in run
[rank0]:     schedule.step(**inputs)
[rank0]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank0]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank0]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 384, in _step_microbatches
[rank0]:     work.wait()
[rank0]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [127.0.1.1]:30124: Connection reset by peer
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank2]:     run(args)
[rank2]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 74, in run
[rank2]:     out = schedule.step()
[rank2]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank2]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank2]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 366, in _step_microbatches
[rank2]:     work.wait()
[rank2]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:30124
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank1]:     run(args)
[rank1]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 74, in run
[rank1]:     out = schedule.step()
[rank1]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank1]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank1]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 366, in _step_microbatches
[rank1]:     work.wait()
[rank1]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank3]:     run(args)
[rank3]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 74, in run
[rank3]:     out = schedule.step()
[rank3]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank3]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank3]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 366, in _step_microbatches
[rank3]:     work.wait()
[rank3]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:29850

@kwen2501
Copy link
Contributor

Yeah, improving CPU support makes sense, though I believe there would be quite some underlying work to do.
If you could share your use case, maybe we can better prioritize it.
Cc: @H-Huang @wconstab

@rednoah91
Copy link
Author

Hi @kwen2501 sorry for the late reply. Our use case is running LLM inference across multiple cpu-based clusters. Could you tell me what is missing in Gloo? How about the MPI support for CPU?

@kwen2501
Copy link
Contributor

Got it, thanks.
PyTorch c10d has a python level API batch_isend_irecv which executes multiple sends and recvs concurrently. Currently this API does not have a stable implementation with Gloo (can hang). The pipelining library uses batch_isend_irecv as its underlying communication method.

@rednoah91
Copy link
Author

rednoah91 commented Jul 31, 2024

@kwen2501 Can I open another issue in pytorch github for tracking the CPU Gloo hang?

@wconstab
Copy link
Contributor

if you can make a repro, go ahead and open the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants