Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

微调ch_PP-OCRv4_det_server_train,训练时评估模型显示out of memory #13759

Open
3 tasks done
ly03240921 opened this issue Aug 27, 2024 · 7 comments
Open
3 tasks done

Comments

@ly03240921
Copy link

🔎 Search before asking

  • I have searched the PaddleOCR Docs and found no similar bug report.
  • I have searched the PaddleOCR Issues and found no similar bug report.
  • I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

[2024/08/27 19:14:23] ppocr INFO: epoch: [5/500], global_step: 10, lr: 0.001000, loss: 2.168079, loss_shrink_maps: 1.022120, loss_threshold_maps: 0.760488, loss_binary_maps: 0.204714, loss_cbn: 0.204714, avg_reader_cost: 0.03694 s, avg_batch_cost: 0.04500 s, avg_samples: 0.12, ips: 2.66682 samples/s, eta: 0:41:51, max_mem_reserved: 13909 MB, max_mem_allocated: 11894 MB
eval model:: 0%| | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 257, in
main(config, device, logger, vdl_writer, seed)
File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 209, in main
program.train(
File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 452, in train
cur_metric = eval(
File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 622, in eval
preds = model(images)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/architectures/base_model.py", line 99, in forward
x = self.head(x, targets=data)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 145, in forward
cbn_maps = self.cbn_layer(self.up_conv(f), shrink_maps, None)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 127, in forward
out = self.last_1(self.last_3(outf))
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/backbones/det_mobilenet_v3.py", line 186, in forward
x = self.conv(x)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/conv.py", line 715, in forward
out = F.conv._conv_nd(
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/functional/conv.py", line 128, in _conv_nd
pre_bias = _C_ops.conv2d(
MemoryError:


C++ Traceback (most recent call last):

0 paddle::pybind::eager_api_conv2d(_object*, _object*, _object*)
1 conv2d_ad_func(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator >, std::vector<int, std::allocator >, std::string, std::vector<int, std::allocator >, int, std::string)
2 paddle::experimental::conv2d(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&)
3 void phi::ConvCudnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, phi::DenseTensor*)
4 float* phi::DeviceContext::Alloc(phi::TensorBase*, unsigned long, bool) const
5 phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
6 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7 paddle::memory::allocation::Allocator::Allocate(unsigned long)
8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 paddle::memory::allocation::Allocator::Allocate(unsigned long)
12 paddle::memory::allocation::Allocator::Allocate(unsigned long)
13 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
14 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
15 phi::enforce::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 1. Cannot allocate 3.158203GB memory on GPU 1, 13.315369GB memory has been allocated and available memory is only 2.386902GB.

Please check whether there is any other process using GPU 1.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model.
    (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:86)

🏃‍♂️ Environment (运行环境)

PaddlePaddle-gpu:2.6 PaddleOCR:2.8 RAM:16G

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

python tools/train.py -c configs/det/ch_PP-OCRv4/ch_PP-OCRv4_det_teacher.yml

@alanxinn
Copy link

显存不够不够,调小batchsize

@ly03240921
Copy link
Author

显存不够不够,调小batchsize

train的batch size是8,跑的时候没问题。eval的batch_size是1,但跑不起来。训练中途每1000个step评估一次嘛,然后它就爆”显存不足“。前面1000个step训练都是正常的

@alanxinn
Copy link

那有试过更改每次评估的step间隔吗?改小

@ly03240921
Copy link
Author

那有试过更改每次评估的step间隔吗?改小

我已经改成10了还是
QQ图片20240828171714
没用,每10个step评估一次

@alanxinn
Copy link

观察一下到底是内存爆了还是显存爆了吧,把batchsize改成4 看看,虽然我也不知道有没有用,没碰到过这种问题

@ly03240921
Copy link
Author

观察一下到底是内存爆了还是显存爆了吧,把batchsize改成4 看看,虽然我也不知道有没有用,没碰到过这种问题

是显存爆了,调了train的batchsize也不行,我训完之后用tools/infer_det.py检测图片也是说显存爆了,就很搞不懂。。。

@alanxinn
Copy link

观察一下到底是内存爆了还是显存爆了吧,把batchsize改成4 看看,虽然我也不知道有没有用,没碰到过这种问题

是显存爆了,调了train的batchsize也不行,我训完之后用tools/infer_det.py检测图片也是说显存爆了,就很搞不懂。。。

paddle有时候有些奇奇怪怪的bug,要不重新装一下训练环境 看看(doge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants