Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KoAlpaca polyglot 12.8b Fine-tuning 시 에러문의 드립니다. #107

Open
puritysarah opened this issue Nov 6, 2023 · 2 comments
Open

Comments

@puritysarah
Copy link

안녕하세요,

12.8b 모델을 https://github.com/Beomi/KoAlpaca/blob/main/train_v1.1b/run_clm.py 코드로 A100 40G 8장에서 파인튜닝 하는중에 다음과 같이 에러가 납니다. (학습 스크립트는 https://github.com/Beomi/KoAlpaca/blob/main/train_v1.1b/train.sh 사용하였습니다.)

Traceback (most recent call last):
File "run_clm_2.py", line 636, in
main()
File "run_clm_2.py", line 412, in main
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2172, in from_pretrained
raise ValueError("Passing along a device_map requires low_cpu_mem_usage=True")
ValueError: Passing along a device_map requires low_cpu_mem_usage=True

그래서 모델 불러올때 low_cpu_mem_usage=True 옵션을 주었더니 아래와 같은 에러가 납니다.

Traceback (most recent call last):
File "run_clm_2.py", line 636, in
main()
File "run_clm_2.py", line 412, in main
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2180, in from_pretrained
raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

깃헙에 공유된 코드 그대로, gpu 개수만 변경하여 진행해봤는데 에러가 나는데요, 혹시 이부분 도움주실 수 있으신지 문의드립니다.

@Beomi
Copy link
Owner

Beomi commented Nov 6, 2023

혹시

pip install -U transformers accelerate

명령어로 두 패키지 버전을 최신으로 맞추고 한번 다시 실행해서 동일한 에러가 나는지 확인해주시겠어요?

@puritysarah
Copy link
Author

puritysarah commented Nov 6, 2023

먼저 빠른 답변감사합니다.

두 패키지들을 업데이트 한 뒤 다시 실행해도 에러가 나는데요.. 다른 서버 (gpu 16장, 8장, 4장) 에서 실행해봐도 같은 에러가 나네요.

Traceback (most recent call last):
File "/workspace/train_v1.1b/run_clm.py", line 637, in
main()
File "/workspace/train_v1.1b/run_clm.py", line 413, in main
model = AutoModelForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2662, in from_pretrained
raise ValueError("Passing along a device_map requires low_cpu_mem_usage=True")
ValueError: Passing along a device_map requires low_cpu_mem_usage=True
[2023-11-06 14:22:09,790] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191726 closing signal SIGTERM
[2023-11-06 14:22:09,790] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191727 closing signal SIGTERM
[2023-11-06 14:22:09,791] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191728 closing signal SIGTERM
[2023-11-06 14:22:10,456] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 191725) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-11-06_14:22:09
host : gpu-a100x8-1.us-central1-c.c.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 191725)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants