微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory #13759

ly03240921 · 2024-08-27T12:12:23Z

🔎 Search before asking

I have searched the PaddleOCR Docs and found no similar bug report.
I have searched the PaddleOCR Issues and found no similar bug report.
I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

[2024/08/27 19:14:23] ppocr INFO: epoch: [5/500], global_step: 10, lr: 0.001000, loss: 2.168079, loss_shrink_maps: 1.022120, loss_threshold_maps: 0.760488, loss_binary_maps: 0.204714, loss_cbn: 0.204714, avg_reader_cost: 0.03694 s, avg_batch_cost: 0.04500 s, avg_samples: 0.12, ips: 2.66682 samples/s, eta: 0:41:51, max_mem_reserved: 13909 MB, max_mem_allocated: 11894 MB
eval model:: 0%| | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 257, in
main(config, device, logger, vdl_writer, seed)
File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 209, in main
program.train(
File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 452, in train
cur_metric = eval(
File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 622, in eval
preds = model(images)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/architectures/base_model.py", line 99, in forward
x = self.head(x, targets=data)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 145, in forward
cbn_maps = self.cbn_layer(self.up_conv(f), shrink_maps, None)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 127, in forward
out = self.last_1(self.last_3(outf))
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/backbones/det_mobilenet_v3.py", line 186, in forward
x = self.conv(x)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/conv.py", line 715, in forward
out = F.conv._conv_nd(
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/functional/conv.py", line 128, in _conv_nd
pre_bias = _C_ops.conv2d(
MemoryError:

C++ Traceback (most recent call last):

0 paddle::pybind::eager_api_conv2d(_object*, _object*, _object*)
1 conv2d_ad_func(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator >, std::vector<int, std::allocator >, std::string, std::vector<int, std::allocator >, int, std::string)
2 paddle::experimental::conv2d(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&)
3 void phi::ConvCudnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, phi::DenseTensor*)
4 float* phi::DeviceContext::Alloc(phi::TensorBase*, unsigned long, bool) const
5 phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
6 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7 paddle::memory::allocation::Allocator::Allocate(unsigned long)
8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 paddle::memory::allocation::Allocator::Allocate(unsigned long)
12 paddle::memory::allocation::Allocator::Allocate(unsigned long)
13 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
14 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
15 phi::enforce::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 1. Cannot allocate 3.158203GB memory on GPU 1, 13.315369GB memory has been allocated and available memory is only 2.386902GB.

Please check whether there is any other process using GPU 1.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:86)

🏃‍♂️ Environment (运行环境)

PaddlePaddle-gpu：2.6 PaddleOCR：2.8 RAM：16G

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

python tools/train.py -c configs/det/ch_PP-OCRv4/ch_PP-OCRv4_det_teacher.yml

alanxinn · 2024-08-28T00:08:06Z

显存不够不够，调小batchsize

ly03240921 · 2024-08-28T00:48:22Z

显存不够不够，调小batchsize

train的batch size是8，跑的时候没问题。eval的batch_size是1，但跑不起来。训练中途每1000个step评估一次嘛，然后它就爆”显存不足“。前面1000个step训练都是正常的

alanxinn · 2024-08-28T08:26:35Z

那有试过更改每次评估的step间隔吗？改小

ly03240921 · 2024-08-28T09:17:49Z

那有试过更改每次评估的step间隔吗？改小

我已经改成10了还是

没用，每10个step评估一次

alanxinn · 2024-08-28T09:22:42Z

观察一下到底是内存爆了还是显存爆了吧，把batchsize改成4 看看，虽然我也不知道有没有用，没碰到过这种问题

ly03240921 · 2024-08-29T03:31:27Z

观察一下到底是内存爆了还是显存爆了吧，把batchsize改成4 看看，虽然我也不知道有没有用，没碰到过这种问题

是显存爆了，调了train的batchsize也不行，我训完之后用tools/infer_det.py检测图片也是说显存爆了，就很搞不懂。。。

alanxinn · 2024-08-29T09:18:05Z

观察一下到底是内存爆了还是显存爆了吧，把batchsize改成4 看看，虽然我也不知道有没有用，没碰到过这种问题

是显存爆了，调了train的batchsize也不行，我训完之后用tools/infer_det.py检测图片也是说显存爆了，就很搞不懂。。。

paddle有时候有些奇奇怪怪的bug，要不重新装一下训练环境看看（doge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory #13759

微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory #13759

ly03240921 commented Aug 27, 2024

alanxinn commented Aug 28, 2024

ly03240921 commented Aug 28, 2024

alanxinn commented Aug 28, 2024

ly03240921 commented Aug 28, 2024

alanxinn commented Aug 28, 2024

ly03240921 commented Aug 29, 2024

alanxinn commented Aug 29, 2024

微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory #13759

微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory #13759

Comments

ly03240921 commented Aug 27, 2024

🔎 Search before asking

🐛 Bug (问题描述)

C++ Traceback (most recent call last):

Error Message Summary:

🏃‍♂️ Environment (运行环境)

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

alanxinn commented Aug 28, 2024

ly03240921 commented Aug 28, 2024

alanxinn commented Aug 28, 2024

ly03240921 commented Aug 28, 2024

alanxinn commented Aug 28, 2024

ly03240921 commented Aug 29, 2024

alanxinn commented Aug 29, 2024