Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configuration error #8

Open
WalkerRusher opened this issue Dec 25, 2024 · 2 comments
Open

configuration error #8

WalkerRusher opened this issue Dec 25, 2024 · 2 comments

Comments

@WalkerRusher
Copy link

I try to operate the following scripts:
accelerate launch train_tokenizer.py
--exp_name bair_tokenizer_ft --output_dir log_vqgan --seed 0 --mixed_precision bf16
--model_type ctx_vqgan
--train_batch_size 16 --gradient_accumulation_steps 1 --disc_start 1000005
--oxe_data_mixes_type bair --resolution 64 --dataloader_num_workers 16
--rand_select --video_stepsize 1 --segment_horizon 16 --segment_length 8 --context_length 1
--pretrained_model_name_or_path pretrained_models/ivideogpt-oxe-64-act-free/tokenizer

However, an error occured:

File "/.local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 769, in multi_gpu_launcher
import torch.distributed.run as distrib_run
File "/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 383, in
from torch.distributed.elastic.multiprocessing import Std
File "/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/init.py", line 68, in
from torch.distributed.elastic.multiprocessing.api import ( # noqa: F401
File "/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 26, in
from torch.distributed.elastic.multiprocessing.redirects import (
File "/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/redirects.py", line 35, in
libc = get_libc()
File "/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/redirects.py", line 32, in get_libc
return ctypes.CDLL("libc.so.6")
File "/usr/local/conda/lib/python3.9/ctypes/init.py", line 382, in init
self._handle = _dlopen(self._name, mode)
OSError: /usr/local/conda/lib/python3.9/site-packages/amp_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

before I operate the script, I install all the requirements as the repo mentioned (pip install -r requirements.txt).

I don't know why this happened. Could you please tell me your exact python version (3.9.x?)? Or any other suggestions would be deeply appreciated.

@WalkerRusher
Copy link
Author

My environment:
python - 3.9.13
torch - 2.2.1+cu121
nvidia driver version: 470.103.01
cuda version: 12.2

@Manchery
Copy link
Collaborator

My environment: python - 3.9.13 torch - 2.2.1+cu121 nvidia driver version: 470.103.01 cuda version: 12.2

All the same as yours, except that Driver Version: 535.104.05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants