Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用IRS数据训练,前面训练正常,但是到第9epoch时output为nan,不知道怎么排查 #141

Open
yangning6103 opened this issue Oct 3, 2024 · 5 comments

Comments

@yangning6103
Copy link

我打印了前推时的log,
print(data["name"])
print(data["left"])
print(data["right"])
with torch.cuda.amp.autocast(enabled=self.cfgs.OPTIMIZATION.AMP):
model_pred = self.model(data)
infer_timer = time.time()
loss, tb_info = loss_func(model_pred, data)
disp_pred = model_pred['disp_pred']
print(disp_pred)
print("loss",loss)
发现输入的数据没问题,但是前推输出为nan,导致loss加计算为nan,
['/IRSDataset/Store/ConvenienceStore_Day/l_566.png']
tensor([[[[ 0.6392, 0.4166, 0.2624, ..., 1.1700, 1.1872, 1.2214],
[ 1.7865, 1.4098, 0.7762, ..., 1.1700, 1.1700, 1.2214],
[ 2.0092, 2.0263, 1.8893, ..., 1.1529, 1.1529, 1.2214],
...,
[-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489],
[-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489],
[-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],

     [[-0.0049, -0.2325, -0.3725,  ..., -2.0357, -1.9832, -1.8081],
      [ 1.2206,  0.8004,  0.1352,  ..., -2.0357, -1.9132, -1.7731],
      [ 1.4657,  1.4832,  1.3081,  ..., -2.0357, -1.9132, -1.6856],
      ...,
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-1.9832, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111]],

     [[-0.8981, -1.1247, -1.2641,  ..., -1.8044, -1.6302, -1.4210],
      [ 0.2696, -0.1138, -0.7587,  ..., -1.7522, -1.6302, -1.4036],
      [ 0.4962,  0.5136,  0.3568,  ..., -1.6824, -1.6302, -1.4036],
      ...,
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.7522, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877]]]],
   device='cuda:0')

tensor([[[[ 0.2282, 0.2453, 0.2453, ..., 1.3070, 1.3242, 1.3242],
[ 0.2453, 0.2453, 0.2453, ..., 1.3070, 1.3070, 1.3584],
[ 0.2282, 0.2453, 0.2453, ..., 1.3242, 1.3242, 1.3584],
...,
[-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489],
[-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489],
[-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],

     [[-0.3375, -0.3375, -0.3200,  ..., -2.0357, -2.0357, -2.0357],
      [-0.3375, -0.3375, -0.3375,  ..., -2.0357, -2.0357, -2.0357],
      [-0.3725, -0.3375, -0.3375,  ..., -2.0357, -2.0357, -2.0357],
      ...,
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111]],

     [[-0.9678, -0.9330, -0.9330,  ..., -1.8044, -1.7522, -1.8044],
      [-0.9504, -0.9504, -0.9504,  ..., -1.8044, -1.7522, -1.7522],
      [-1.0027, -0.9678, -0.9504,  ..., -1.7522, -1.7522, -1.6824],
      ...,
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.8044, -1.8044, -1.8044,  ...,  2.6051,  2.5877,  2.5877]]]],
   device='cuda:0')

tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0',
grad_fn=)
tensor(nan, device='cuda:0', grad_fn=)
2024-10-03 11:53:39,003 INFO Training Epoch: 9/50 Iter: 947/5661 Loss:nan(nan) LR:6.7625e-04 DataTime:0.12 InferTime:43.17ms Time cost: 06:46/33:37:36
l/OpenStereo/./stereo/utils/common_utils.py:198: RuntimeWarning: invalid value encountered in cast
pred_tmp = cm(pred_tmp.astype('uint8'))
//OpenStereo/./stereo/utils/common_utils.py:199: RuntimeWarning: invalid value encountered in cast
error_map_tmp = cm(error_map_tmp.astype('uint8'))
请问一下,这是数据有问题吗?但是现在还不知道怎么排查数据,是左右目没有对齐吗?

@t973288913
Copy link

一样的情况,模型输出全为0,应该怎么处理?

@XiandaGuo
Copy link
Owner

The cfg will be released soon.

@Dongxin000
Copy link

我也是自己的数据集出现了相同的问题,在第9个epoch loss出现了nan
image

@yangning6103
Copy link
Author

一样的情况,模型输出全为0,应该怎么处理?

可以暂时将配置文件中的LEFT_ATT 置为false

@zjuPeco
Copy link

zjuPeco commented Nov 11, 2024

一样的情况IGEV配置文件,SceneFlow+Fat数据集

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants