Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training error: INTERNAL ASSERT FAILED #126

Closed
hejxiang opened this issue Nov 12, 2024 · 9 comments
Closed

Training error: INTERNAL ASSERT FAILED #126

hejxiang opened this issue Nov 12, 2024 · 9 comments

Comments

@hejxiang
Copy link

hejxiang commented Nov 12, 2024

When training the llm model according to the example, the following error occurred
Qwen1.5-0.5B-Chat and chatglm3-6b had same error.

Please help me check where the problem is.

Thanks !!!

The system configuration and environment are as follows:

FATE-LLM v2.2.0 cluster, 3 machine,The toy example can be successfully run on multiple devices.

accelerate                               0.27.2
deepspeed                                0.13.3
peft                                     0.8.2
torch                                    2.3.1
transformers                             4.37.2

The error:

[ERROR][2024-11-11 14:16:22,115][403433][_wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, **execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/_component.py", line 101, in execute\n return self.callback(ctx, role, **kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(train_data_, validate_data_, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _prepare_deepspeed\n engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in __init__\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None}

The code:

import time
from fate_client.pipeline.components.fate.reader import Reader
from fate_client.pipeline import FateFlowPipeline
from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_config_of_seq2seq_runner
from fate_client.pipeline.components.fate.nn.algo_params import Seq2SeqTrainingArguments, FedAVGArguments
from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader
from peft import LoraConfig, TaskType


guest = '10000'
host = '10000'
arbiter = '10000'

epochs = 1
batch_size = 1
lr = 5e-4

ds_config = {
    "train_micro_batch_size_per_gpu": batch_size,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": lr,
            "torch_adam": True,
            "adam_w_mode": False
        }
    },
    "fp16": {
        "enabled": True
    },
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": True,
        "allgather_bucket_size": 1e8,
        "overlap_comm": True,
        "reduce_scatter": True,
        "reduce_bucket_size": 1e8,
        "contiguous_gradients": True,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
        }
    }
}

pipeline = FateFlowPipeline().set_parties(guest=guest, host=host, arbiter=arbiter)
pipeline.bind_local_path(path="/ws/data/test/fate/FATE-LLM/examples/data/AdvertiseGen/train.json", namespace="experiment", name="ad")
time.sleep(5)



reader_0 = Reader("reader_0", runtime_parties=dict(guest=guest, host=host))
reader_0.guest.task_parameters(
    namespace="experiment",
    name="ad"
)
reader_0.hosts[0].task_parameters(
    namespace="experiment",
    name="ad"
)

# define lora config
# lora_config = LoraConfig(
#     task_type=TaskType.CAUSAL_LM,
#     inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1,
#     target_modules=['query_key_value'],
# )

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1,
    target_modules=['q_proj'],
)

lora_config.target_modules = list(lora_config.target_modules)

# pretrained_model_path = "/ws/data/test/models/chatglm3-6b"

# model = LLMModelLoader(
#     "pellm.chatglm",
#     "ChatGLM",
#     pretrained_path=pretrained_model_path,
#     peft_type="LoraConfig",
#     peft_config=lora_config.to_dict(),
#     trust_remote_code=True
# )

pretrained_model_path = "/ws/data/test/models/Qwen1.5-0.5B-Chat"

model = LLMModelLoader(
    "pellm.qwen",
    "Qwen",
    pretrained_path=pretrained_model_path,
    peft_type="LoraConfig",
    peft_config=lora_config.to_dict(),
    trust_remote_code=True
)


tokenizer_params = dict(
    tokenizer_name_or_path=pretrained_model_path,
    trust_remote_code=True,
)

dataset = LLMDatasetLoader(
    "prompt_dataset",
    "PromptDataset",
    **tokenizer_params,
)

data_collator = LLMDataFuncLoader(
    "data_collator.cust_data_collator",
    "get_seq2seq_data_collator",
    **tokenizer_params,
)

conf = get_config_of_seq2seq_runner(
    algo='fedavg',
    model=model,
    dataset=dataset,
    data_collator=data_collator,
    training_args=Seq2SeqTrainingArguments(
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        remove_unused_columns=False, 
        predict_with_generate=False,
        deepspeed=ds_config,
        learning_rate=lr,
        use_cpu=False, # this must be set as we will gpu
        fp16=True,
    ),
    fed_args=FedAVGArguments(),
    task_type='causal_lm',
    save_trainable_weights_only=True # only save trainable weights
)

homo_nn_0 = HomoNN(
    'nn_0',
    runner_conf=conf,
    train_data=reader_0.outputs["output_data"],
    runner_module="homo_seq2seq_runner",
    runner_class="Seq2SeqRunner",
)

homo_nn_0.guest.conf.set("launcher_name", "deepspeed") # tell schedule engine to run task with deepspeed
homo_nn_0.hosts[0].conf.set("launcher_name", "deepspeed") # tell schedule engine to run task with deepspeed

pipeline.add_tasks([reader_0, homo_nn_0])
pipeline.conf.set("task", dict(engine_run={"cores": 1})) # the number of gpus of each party

pipeline.compile()
pipeline.fit()
@mgqa34
Copy link
Contributor

mgqa34 commented Nov 13, 2024

We need time to research and solve this problem. If you need to use deepspeed, please use the previous version, like fate_llm v2.1.0

@hejxiang
Copy link
Author

hejxiang commented Nov 14, 2024

Thanks for your reply, but I still encountered the same issue after using the fate_llm version 2.1.0. 😅

git checkout tags/v2.1.0
source /ws/data/test/fate/projects/fate/bin/init_env.sh
pip install -r requirements.txt
pip install -e .  
# also try install with --use-pep517

The requirements.txt in v2.1.0 didn't have torch or transformers

The error:
`

[ERROR][2024-11-14 11:31:02,832][855754][wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, **execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/component.py", line 101, in execute\n return self.callback(ctx, role, **kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(train_data, validate_data, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _prepare_deepspeed\n engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None}
2
[ERROR][2024-11-14 11:31:02,832][855754][wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, **execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/component.py", line 101, in execute\n return self.callback(ctx, role, **kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(train_data, validate_data, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _prepare_deepspeed\n engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None}

`

@mgqa34
Copy link
Contributor

mgqa34 commented Nov 19, 2024

Thanks for your reply, but I still encountered the same issue after using the fate_llm version 2.1.0. 😅

git checkout tags/v2.1.0
source /ws/data/test/fate/projects/fate/bin/init_env.sh
pip install -r requirements.txt
pip install -e .  
# also try install with --use-pep517

The requirements.txt in v2.1.0 didn't have torch or transformers

The error: `

[ERROR][2024-11-14 11:31:02,832][855754][wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, **execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/component.py", line 101, in execute\n return self.callback(ctx, role, **kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(train_data, validate_data, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _prepare_deepspeed\n engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None} 2 [ERROR][2024-11-14 11:31:02,832][855754][wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, **execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/component.py", line 101, in execute\n return self.callback(ctx, role, **kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(train_data, validate_data, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _prepare_deepspeed\n engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None}

`

Does the python version is "3.8" and the deploying packages are downloaded from here?

@hejxiang
Copy link
Author

I installed the cluster by AnsibleFATE_2.2.0_release_offline.tar.gz, It should be from the link above.

But the python version is not 3.8, is 3.10.13, after installing the cluster, it had a python env, So I used the env directly

source /ws/data/test/fate/projects/fate/bin/init_env.sh
conda list python

python 3.10.13 h955ad1f_0

@mgqa34
Copy link
Contributor

mgqa34 commented Nov 20, 2024

I installed the cluster by AnsibleFATE_2.2.0_release_offline.tar.gz, It should be from the link above.

But the python version is not 3.8, is 3.10.13, after installing the cluster, it had a python env, So I used the env directly

source /ws/data/test/fate/projects/fate/bin/init_env.sh
conda list python

python 3.10.13 h955ad1f_0

As mentioned above,please reinstall v2.1.0 to avoid this problem, not v2.2.0.

@hejxiang
Copy link
Author

I installed the cluster by AnsibleFATE_2.2.0_release_offline.tar.gz, It should be from the link above.
But the python version is not 3.8, is 3.10.13, after installing the cluster, it had a python env, So I used the env directly

source /ws/data/test/fate/projects/fate/bin/init_env.sh
conda list python

python 3.10.13 h955ad1f_0

As mentioned above,please reinstall v2.1.0 to avoid this problem, not v2.2.0.

OK, I will try it again,Thanks.

@mgqa34
Copy link
Contributor

mgqa34 commented Nov 20, 2024

I installed the cluster by AnsibleFATE_2.2.0_release_offline.tar.gz, It should be from the link above.
But the python version is not 3.8, is 3.10.13, after installing the cluster, it had a python env, So I used the env directly

source /ws/data/test/fate/projects/fate/bin/init_env.sh
conda list python

python 3.10.13 h955ad1f_0

As mentioned above,please reinstall v2.1.0 to avoid this problem, not v2.2.0.

OK, I will try it again,Thanks.

You/re welcome, we’ll keep investing the issue you’ve mentioned above when training using deepspeed in v2.2.0.

@hejxiang
Copy link
Author

I still had the same errors

I used this release: AnsibleFATE_2.1.0_LLM_2.0.0_release_offline.tar.gz

The version of python and others are as following:

python: 3.8.13
deepspeed: 0.13.3
torch: 2.3.1+cu121

Perhaps due to the fact that the GPU graphics card I am using is RTX 4090, this series encounters errors during execution and requires setting environment variables to proceed.

os.environ["NCCL_P2P_DISABLE"]="1"

os.environ["NCCL_IB_DISABLE"]="1"
/accelerate/state.py", line 226, in init
raise NotImplementedError(
NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. 
Please set NCCL_P2P_DISABLE="1" and NCCL_IB_DISABLE="1" 
or use accelerate launch` which will do this automatically.

During handling of the above exception, another exception

@hejxiang
Copy link
Author

After replacing the graphics card with the H800, it can continue to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants