Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom dataset evaluate OOM #15

Open
GondorFu opened this issue Feb 15, 2023 · 7 comments
Open

Custom dataset evaluate OOM #15

GondorFu opened this issue Feb 15, 2023 · 7 comments

Comments

@GondorFu
Copy link

When I use this model on custom dataset, it is normal in training phase,but once in evaluation phase, it's always encountering out of memory for GPU.

What's the possible reason for this?

@Gofinge
Copy link
Member

Gofinge commented Feb 15, 2023

Hi, could you show me your custom dataset config (whole exported config in exp folder is better) and OOM output?

@GondorFu
Copy link
Author

Hi, could you show me your custom dataset config (whole exported config in exp folder is better) and OOM output?

These are config and error log

Config:
weight = None
resume = False
evaluate = True
test_only = False
seed = 318105
save_path =
num_worker = 16
batch_size = 8
batch_size_val = None
batch_size_test = 1
epoch = 100
eval_epoch = 100
save_freq = None
eval_metric = 'mIoU'
sync_bn = False
enable_amp = True
empty_cache = False
find_unused_parameters = False
max_batch_points = 100000000.0
mix_prob = 0.8
param_dicts = None
test = dict(type='SegmentationTest')
model = dict(
type='ptv2m2',
in_channels=6,
num_classes=
patch_embed_depth=2,
patch_embed_channels=48,
patch_embed_groups=6,
patch_embed_neighbours=16,
enc_depths=(2, 6, 2),
enc_channels=(96, 192, 384),
enc_groups=(12, 24, 48),
enc_neighbours=(16, 16, 16),
dec_depths=(1, 1, 1),
dec_channels=(48, 96, 192),
dec_groups=(6, 12, 24),
dec_neighbours=(16, 16, 16),
grid_sizes=(0.1, 0.2, 0.4),
attn_qkv_bias=True,
pe_multiplier=False,
pe_bias=True,
attn_drop_rate=0.0,
drop_path_rate=0.3,
enable_checkpoint=False,
unpool_backend='interp')
optimizer = dict(type='AdamW', lr=0.006, weight_decay=0.05)
scheduler = dict(type='MultiStepLR', milestones=[0.6, 0.8], gamma=0.1)
dataset_type = 'AutoScenesDataset'
data_root = '
data = dict(
num_classes=
ignore_label=255,
names=['
train=dict(
type=
split='train',
data_root='
transform=[
dict(type='CenterShift', apply_z=True),
dict(type='RandomScale', scale=[0.9, 1.1]),
dict(type='RandomFlip', p=0.5),
dict(type='RandomJitter', sigma=0.005, clip=0.02),
dict(type='ChromaticAutoContrast', p=0.2, blend_factor=None),
dict(type='ChromaticTranslation', p=0.95, ratio=0.05),
dict(type='ChromaticJitter', p=0.95, std=0.05),
dict(
type='Voxelize',
voxel_size=0.04,
hash_type='fnv',
mode='train',
keys=('coord', 'color', 'label'),
return_discrete_coord=True),
dict(type='SphereCrop', point_max=100000, mode='random'),
dict(type='CenterShift', apply_z=False),
dict(type='NormalizeColor'),
dict(type='ToTensor'),
dict(
type='Collect',
keys=('coord', 'discrete_coord', 'label'),
feat_keys=['coord', 'color'])
],
test_mode=False,
loop=1),
val=dict(
type='
split='val',
data_root='
transform=[
dict(type='CenterShift', apply_z=True),
dict(
type='Copy',
keys_dict=dict(coord='origin_coord', label='origin_label')),
dict(
type='Voxelize',
voxel_size=0.04,
hash_type='fnv',
mode='train',
keys=('coord', 'color', 'label'),
return_discrete_coord=True),
dict(type='CenterShift', apply_z=False),
dict(type='NormalizeColor'),
dict(type='ToTensor'),
dict(
type='Collect',
keys=('coord', 'discrete_coord', 'label'),
offset_keys_dict=dict(offset='coord'),
feat_keys=['coord', 'color'])
],
test_mode=False),
test=dict(
type='
split='test',
data_root='
transform=[
dict(type='CenterShift', apply_z=True),
dict(type='NormalizeColor')
],
test_mode=True,
test_cfg=dict(
voxelize=dict(
type='Voxelize',
voxel_size=0.04,
hash_type='fnv',
mode='test',
keys=('coord', 'color'),
return_discrete_coord=True),
crop=None,
post_transform=[
dict(type='CenterShift', apply_z=False),
dict(type='ToTensor'),
dict(
type='Collect',
keys=('coord', 'discrete_coord', 'index'),
feat_keys=('coord', 'color'))
],
aug_transform=[[{
'type': 'RandomScale',
'scale': [0.9, 0.9]
}], [{
'type': 'RandomScale',
'scale': [0.95, 0.95]
}], [{
'type': 'RandomScale',
'scale': [1, 1]
}], [{
'type': 'RandomScale',
'scale': [1.05, 1.05]
}], [{
'type': 'RandomScale',
'scale': [1.1, 1.1]
}],
[{
'type': 'RandomScale',
'scale': [0.9, 0.9]
}, {
'type': 'RandomFlip',
'p': 1
}],
[{
'type': 'RandomScale',
'scale': [0.95, 0.95]
}, {
'type': 'RandomFlip',
'p': 1
}],
[{
'type': 'RandomScale',
'scale': [1, 1]
}, {
'type': 'RandomFlip',
'p': 1
}],
[{
'type': 'RandomScale',
'scale': [1.05, 1.05]
}, {
'type': 'RandomFlip',
'p': 1
}],
[{
'type': 'RandomScale',
'scale': [1.1, 1.1]
}, {
'type': 'RandomFlip',
'p': 1
}]])))
criteria = [dict(type='CrossEntropyLoss', loss_weight=1.0, ignore_index=255)]
num_worker_per_gpu = 2
batch_size_per_gpu = 1
batch_size_val_per_gpu = 1

Start Evaluation >>>>>>>>>>>>>>>>
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown
len(cache))
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown
len(cache))
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "tools/train.py", line 34, in
main()
File "tools/train.py", line 29, in main
cfg=(cfg,),
File "PointTransformerV2/pcr/engines/launch.py", line 84, in launch
daemon=False,
File python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "PointTransformerV2/pcr/engines/launch.py", line 183, in _distributed_worker
main_func(*cfg)
File "PointTransformerV2/tools/train.py", line 16, in main_worker
trainer.train()
File "PointTransformerV2/pcr/engines/defaults.py", line 216, in train
self.after_epoch()
File "PointTransformerV2/pcr/engines/defaults.py", line 321, in after_epoch
self.eval()
File "PointTransformerV2/pcr/engines/defaults.py", line 334, in eval
output = self.model(input_dict)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 517, in forward
points = self.patch_embed(points)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 411, in forward
return self.blocks([coord, feat, offset])
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 209, in forward
points = block(points, reference_index)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 157, in forward
if not self.enable_checkpoint else checkpoint(self.attn, feat, coord, reference_index)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 108, in forward
peb = self.linear_p_bias(pos)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 37, in forward
return self.norm(input.transpose(1, 2).contiguous()).transpose(1, 2).contiguous()
File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run
res = func(*args, **kwargs)
File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "python3.7/site-packages/torch/nn/modules/batchnorm.py", line 179, in forward
self.eps,
File "python3.7/site-packages/torch/nn/functional.py", line 2283, in batch_norm
input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 5.38 GiB (GPU 3; 31.75 GiB total capacity; 28.67 GiB already allocated; 1.47 GiB free; 28.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@Gofinge
Copy link
Member

Gofinge commented Feb 15, 2023

The config looks good. Is there any potential that the validation point clouds are significantly larger than the training point clouds? What about data_dict["coord"].shape? (It would be helpful if you can log it out before OOM)

@GondorFu
Copy link
Author

I random split the train and val, so there won't be much difference.
You can see before OOM, the code try to allocate 5.38 GB, in my custom dataset, each pc only about 100M and just one for each GPU, what do you think is the reason why it need to allocate such a large amount of memory?

@Gofinge
Copy link
Member

Gofinge commented Feb 15, 2023

I am sorry about that issue. I never encounter a similar problem. I notice that the validation batch size per GPU is identical to the train batch size per GPU (both 1). The memory consumption of the evaluation process should be much lower than the training process.

For debugging this issue, my suggestion is to print out the input shape before feeding it into the model.

@GondorFu
Copy link
Author

I am sorry about that issue. I never encounter a similar problem. I notice that the validation batch size per GPU is identical to the train batch size per GPU (both 1). The memory consumption of the evaluation process should be much lower than the training process.

For debugging this issue, my suggestion is to print out the input shape before feeding it into the model.

this is the eval size

Start Evaluation >>>>>>>>>>>>>>>>
val size >>>>>>>>>>>>>>>: 655712
val size >>>>>>>>>>>>>>>: 872887
val size >>>>>>>>>>>>>>>: 871667
val size >>>>>>>>>>>>>>>: 1273970
val size >>>>>>>>>>>>>>>: 1541887
val size >>>>>>>>>>>>>>>: 1918415
val size >>>>>>>>>>>>>>>: 1695826
val size >>>>>>>>>>>>>>>: 2831842

@Gofinge
Copy link
Member

Gofinge commented Feb 17, 2023

Hi, that was quite a huge number for a point cloud after voxelization. Maybe you can further validate whether the validation point cloud voxelized successfully.

@Gofinge Gofinge changed the title evaluate OOM Custom dataset evaluate OOM Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants