-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training error: INTERNAL ASSERT FAILED #126
Comments
We need time to research and solve this problem. If you need to use deepspeed, please use the previous version, like fate_llm v2.1.0 |
Thanks for your reply, but I still encountered the same issue after using the fate_llm version 2.1.0. 😅
The requirements.txt in v2.1.0 didn't have torch or transformers The error: [ERROR][2024-11-14 11:31:02,832][855754][wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, **execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/component.py", line 101, in execute\n return self.callback(ctx, role, **kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(train_data, validate_data, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _prepare_deepspeed\n engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None} ` |
Does the python version is "3.8" and the deploying packages are downloaded from here? |
I installed the cluster by AnsibleFATE_2.2.0_release_offline.tar.gz, It should be from the link above. But the python version is not 3.8, is 3.10.13, after installing the cluster, it had a python env, So I used the env directly
python 3.10.13 h955ad1f_0 |
As mentioned above,please reinstall v2.1.0 to avoid this problem, not v2.2.0. |
OK, I will try it again,Thanks. |
You/re welcome, we’ll keep investing the issue you’ve mentioned above when training using deepspeed in v2.2.0. |
I still had the same errors I used this release: AnsibleFATE_2.1.0_LLM_2.0.0_release_offline.tar.gz The version of python and others are as following:
Perhaps due to the fact that the GPU graphics card I am using is RTX 4090, this series encounters errors during execution and requires setting environment variables to proceed.
|
After replacing the graphics card with the H800, it can continue to run. |
When training the llm model according to the example, the following error occurred
Qwen1.5-0.5B-Chat and chatglm3-6b had same error.
Please help me check where the problem is.
Thanks !!!
The system configuration and environment are as follows:
The error:
[ERROR][2024-11-11 14:16:22,115][403433][_wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, **execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/_component.py", line 101, in execute\n return self.callback(ctx, role, **kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(train_data_, validate_data_, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _prepare_deepspeed\n engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in __init__\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None}
The code:
The text was updated successfully, but these errors were encountered: