Custom dataset evaluate OOM #15

GondorFu · 2023-02-15T07:18:31Z

When I use this model on custom dataset, it is normal in training phase，but once in evaluation phase, it's always encountering out of memory for GPU.

What's the possible reason for this?

Gofinge · 2023-02-15T07:20:04Z

Hi, could you show me your custom dataset config (whole exported config in exp folder is better) and OOM output?

GondorFu · 2023-02-15T07:33:40Z

Hi, could you show me your custom dataset config (whole exported config in exp folder is better) and OOM output?

These are config and error log

Config:
weight = None
resume = False
evaluate = True
test_only = False
seed = 318105
save_path =
num_worker = 16
batch_size = 8
batch_size_val = None
batch_size_test = 1
epoch = 100
eval_epoch = 100
save_freq = None
eval_metric = 'mIoU'
sync_bn = False
enable_amp = True
empty_cache = False
find_unused_parameters = False
max_batch_points = 100000000.0
mix_prob = 0.8
param_dicts = None
test = dict(type='SegmentationTest')
model = dict(
type='ptv2m2',
in_channels=6,
num_classes=
patch_embed_depth=2,
patch_embed_channels=48,
patch_embed_groups=6,
patch_embed_neighbours=16,
enc_depths=(2, 6, 2),
enc_channels=(96, 192, 384),
enc_groups=(12, 24, 48),
enc_neighbours=(16, 16, 16),
dec_depths=(1, 1, 1),
dec_channels=(48, 96, 192),
dec_groups=(6, 12, 24),
dec_neighbours=(16, 16, 16),
grid_sizes=(0.1, 0.2, 0.4),
attn_qkv_bias=True,
pe_multiplier=False,
pe_bias=True,
attn_drop_rate=0.0,
drop_path_rate=0.3,
enable_checkpoint=False,
unpool_backend='interp')
optimizer = dict(type='AdamW', lr=0.006, weight_decay=0.05)
scheduler = dict(type='MultiStepLR', milestones=[0.6, 0.8], gamma=0.1)
dataset_type = 'AutoScenesDataset'
data_root = '
data = dict(
num_classes=
ignore_label=255,
names=['
train=dict(
type=
split='train',
data_root='
transform=[
dict(type='CenterShift', apply_z=True),
dict(type='RandomScale', scale=[0.9, 1.1]),
dict(type='RandomFlip', p=0.5),
dict(type='RandomJitter', sigma=0.005, clip=0.02),
dict(type='ChromaticAutoContrast', p=0.2, blend_factor=None),
dict(type='ChromaticTranslation', p=0.95, ratio=0.05),
dict(type='ChromaticJitter', p=0.95, std=0.05),
dict(
type='Voxelize',
voxel_size=0.04,
hash_type='fnv',
mode='train',
keys=('coord', 'color', 'label'),
return_discrete_coord=True),
dict(type='SphereCrop', point_max=100000, mode='random'),
dict(type='CenterShift', apply_z=False),
dict(type='NormalizeColor'),
dict(type='ToTensor'),
dict(
type='Collect',
keys=('coord', 'discrete_coord', 'label'),
feat_keys=['coord', 'color'])
],
test_mode=False,
loop=1),
val=dict(
type='
split='val',
data_root='
transform=[
dict(type='CenterShift', apply_z=True),
dict(
type='Copy',
keys_dict=dict(coord='origin_coord', label='origin_label')),
dict(
type='Voxelize',
voxel_size=0.04,
hash_type='fnv',
mode='train',
keys=('coord', 'color', 'label'),
return_discrete_coord=True),
dict(type='CenterShift', apply_z=False),
dict(type='NormalizeColor'),
dict(type='ToTensor'),
dict(
type='Collect',
keys=('coord', 'discrete_coord', 'label'),
offset_keys_dict=dict(offset='coord'),
feat_keys=['coord', 'color'])
],
test_mode=False),
test=dict(
type='
split='test',
data_root='
transform=[
dict(type='CenterShift', apply_z=True),
dict(type='NormalizeColor')
],
test_mode=True,
test_cfg=dict(
voxelize=dict(
type='Voxelize',
voxel_size=0.04,
hash_type='fnv',
mode='test',
keys=('coord', 'color'),
return_discrete_coord=True),
crop=None,
post_transform=[
dict(type='CenterShift', apply_z=False),
dict(type='ToTensor'),
dict(
type='Collect',
keys=('coord', 'discrete_coord', 'index'),
feat_keys=('coord', 'color'))
],
aug_transform=[[{
'type': 'RandomScale',
'scale': [0.9, 0.9]
}], [{
'type': 'RandomScale',
'scale': [0.95, 0.95]
}], [{
'type': 'RandomScale',
'scale': [1, 1]
}], [{
'type': 'RandomScale',
'scale': [1.05, 1.05]
}], [{
'type': 'RandomScale',
'scale': [1.1, 1.1]
}],
[{
'type': 'RandomScale',
'scale': [0.9, 0.9]
}, {
'type': 'RandomFlip',
'p': 1
}],
[{
'type': 'RandomScale',
'scale': [0.95, 0.95]
}, {
'type': 'RandomFlip',
'p': 1
}],
[{
'type': 'RandomScale',
'scale': [1, 1]
}, {
'type': 'RandomFlip',
'p': 1
}],
[{
'type': 'RandomScale',
'scale': [1.05, 1.05]
}, {
'type': 'RandomFlip',
'p': 1
}],
[{
'type': 'RandomScale',
'scale': [1.1, 1.1]
}, {
'type': 'RandomFlip',
'p': 1
}]])))
criteria = [dict(type='CrossEntropyLoss', loss_weight=1.0, ignore_index=255)]
num_worker_per_gpu = 2
batch_size_per_gpu = 1
batch_size_val_per_gpu = 1

Start Evaluation >>>>>>>>>>>>>>>>
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown
len(cache))
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown
len(cache))
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "tools/train.py", line 34, in
main()
File "tools/train.py", line 29, in main
cfg=(cfg,),
File "PointTransformerV2/pcr/engines/launch.py", line 84, in launch
daemon=False,
File python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "PointTransformerV2/pcr/engines/launch.py", line 183, in _distributed_worker
main_func(*cfg)
File "PointTransformerV2/tools/train.py", line 16, in main_worker
trainer.train()
File "PointTransformerV2/pcr/engines/defaults.py", line 216, in train
self.after_epoch()
File "PointTransformerV2/pcr/engines/defaults.py", line 321, in after_epoch
self.eval()
File "PointTransformerV2/pcr/engines/defaults.py", line 334, in eval
output = self.model(input_dict)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 517, in forward
points = self.patch_embed(points)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 411, in forward
return self.blocks([coord, feat, offset])
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 209, in forward
points = block(points, reference_index)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 157, in forward
if not self.enable_checkpoint else checkpoint(self.attn, feat, coord, reference_index)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 108, in forward
peb = self.linear_p_bias(pos)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 37, in forward
return self.norm(input.transpose(1, 2).contiguous()).transpose(1, 2).contiguous()
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "python3.7/site-packages/torch/nn/modules/batchnorm.py", line 179, in forward
self.eps,
File "python3.7/site-packages/torch/nn/functional.py", line 2283, in batch_norm
input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 5.38 GiB (GPU 3; 31.75 GiB total capacity; 28.67 GiB already allocated; 1.47 GiB free; 28.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Gofinge · 2023-02-15T07:43:39Z

The config looks good. Is there any potential that the validation point clouds are significantly larger than the training point clouds? What about data_dict["coord"].shape? (It would be helpful if you can log it out before OOM)

GondorFu · 2023-02-15T08:25:28Z

I random split the train and val, so there won't be much difference.
You can see before OOM, the code try to allocate 5.38 GB, in my custom dataset, each pc only about 100M and just one for each GPU, what do you think is the reason why it need to allocate such a large amount of memory?

Gofinge · 2023-02-15T08:35:34Z

I am sorry about that issue. I never encounter a similar problem. I notice that the validation batch size per GPU is identical to the train batch size per GPU (both 1). The memory consumption of the evaluation process should be much lower than the training process.

For debugging this issue, my suggestion is to print out the input shape before feeding it into the model.

GondorFu · 2023-02-15T23:26:12Z

I am sorry about that issue. I never encounter a similar problem. I notice that the validation batch size per GPU is identical to the train batch size per GPU (both 1). The memory consumption of the evaluation process should be much lower than the training process.

For debugging this issue, my suggestion is to print out the input shape before feeding it into the model.

this is the eval size

Start Evaluation >>>>>>>>>>>>>>>>
val size >>>>>>>>>>>>>>>: 655712
val size >>>>>>>>>>>>>>>: 872887
val size >>>>>>>>>>>>>>>: 871667
val size >>>>>>>>>>>>>>>: 1273970
val size >>>>>>>>>>>>>>>: 1541887
val size >>>>>>>>>>>>>>>: 1918415
val size >>>>>>>>>>>>>>>: 1695826
val size >>>>>>>>>>>>>>>: 2831842

Gofinge · 2023-02-17T06:33:25Z

Hi, that was quite a huge number for a point cloud after voxelization. Maybe you can further validate whether the validation point cloud voxelized successfully.

Gofinge changed the title ~~evaluate OOM~~ Custom dataset evaluate OOM Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom dataset evaluate OOM #15

Custom dataset evaluate OOM #15

GondorFu commented Feb 15, 2023

Gofinge commented Feb 15, 2023 •

edited

Loading

GondorFu commented Feb 15, 2023

Gofinge commented Feb 15, 2023

GondorFu commented Feb 15, 2023

Gofinge commented Feb 15, 2023

GondorFu commented Feb 15, 2023

Gofinge commented Feb 17, 2023

Custom dataset evaluate OOM #15

Custom dataset evaluate OOM #15

Comments

GondorFu commented Feb 15, 2023

Gofinge commented Feb 15, 2023 • edited Loading

GondorFu commented Feb 15, 2023

Gofinge commented Feb 15, 2023

GondorFu commented Feb 15, 2023

Gofinge commented Feb 15, 2023

GondorFu commented Feb 15, 2023

Gofinge commented Feb 17, 2023

Gofinge commented Feb 15, 2023 •

edited

Loading