Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛BUG] 分布式训练后的数据加载问题 #2063

Open
zw81929 opened this issue Jul 8, 2024 · 2 comments
Open

[🐛BUG] 分布式训练后的数据加载问题 #2063

zw81929 opened this issue Jul 8, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@zw81929
Copy link

zw81929 commented Jul 8, 2024

报错信息如下

Traceback (most recent call last):
  File "/data1/bert4rec/bert4rec-main/scripts/bole/loaddata_run_product.py", line 5, in <module>
    config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
                                                                ^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/quick_start/quick_start.py", line 259, in load_data_and_model
    train_data, valid_data, test_data = data_preparation(config, dataset)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/data/utils.py", line 174, in data_preparation
    train_data = get_dataloader(config, "train")(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/data/dataloader/general_dataloader.py", line 45, in __init__
    super().__init__(config, dataset, sampler, shuffle=shuffle)
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/data/dataloader/abstract_dataloader.py", line 130, in __init__
    super().__init__(config, dataset, sampler, shuffle=shuffle)
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/data/dataloader/abstract_dataloader.py", line 60, in __init__
    index_sampler = torch.utils.data.distributed.DistributedSampler(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/venv/lib/python3.11/site-packages/torch/utils/data/distributed.py", line 68, in __init__
    num_replicas = dist.get_world_size()
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1769, in get_world_size
    return _get_group_size(group)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 841, in _get_group_size
    default_pg = _get_default_group()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

这块

@zw81929 zw81929 added the bug Something isn't working label Jul 8, 2024
@zw81929
Copy link
Author

zw81929 commented Jul 8, 2024

abstract_dataloader.py 中如下部分的代码可能是有问题,self.sample_size 没有初始化

    def __init__(self, config, dataset, sampler, shuffle=False):
        self.shuffle = shuffle
        self.config = config
        self._dataset = dataset
        self._sampler = sampler
        self._batch_size = self.step = self.model = None
        self._init_batch_size_and_step()
        index_sampler = None
        self.generator = torch.Generator()
        self.generator.manual_seed(config["seed"])
        self.transform = construct_transform(config)
        self.is_sequential = config["MODEL_TYPE"] == ModelType.SEQUENTIAL
        
        if not config["single_spec"]:
            index_sampler = torch.utils.data.distributed.DistributedSampler(
                list(range(self.sample_size)), shuffle=shuffle, drop_last=False
            )
            self.step = max(1, self.step // config["world_size"])
            shuffle = False
        super().__init__(
            dataset=list(range(self.sample_size)),
            batch_size=self.step,
            collate_fn=self.collate_fn,
            num_workers=config["worker"],
            shuffle=shuffle,
            sampler=index_sampler,
            generator=self.generator,
        )

@Fotiligner
Copy link
Collaborator

这是多卡分布式训练时的报错,如果你使用单卡运行程序,可以将torch.distributed.barrier()注释掉;如果你使用多卡运行程序仍出现该错误,可以尝试在config文件中设置多个gpu_id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants