使用IRS数据训练，前面训练正常，但是到第9epoch时output为nan，不知道怎么排查 #141

yangning6103 · 2024-10-03T04:26:01Z

我打印了前推时的log，
print(data["name"])
print(data["left"])
print(data["right"])
with torch.cuda.amp.autocast(enabled=self.cfgs.OPTIMIZATION.AMP):
model_pred = self.model(data)
infer_timer = time.time()
loss, tb_info = loss_func(model_pred, data)
disp_pred = model_pred['disp_pred']
print(disp_pred)
print("loss",loss)
发现输入的数据没问题，但是前推输出为nan，导致loss加计算为nan，
['/IRSDataset/Store/ConvenienceStore_Day/l_566.png']
tensor([[[[ 0.6392, 0.4166, 0.2624, ..., 1.1700, 1.1872, 1.2214],
[ 1.7865, 1.4098, 0.7762, ..., 1.1700, 1.1700, 1.2214],
[ 2.0092, 2.0263, 1.8893, ..., 1.1529, 1.1529, 1.2214],
...,
[-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489],
[-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489],
[-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],

     [[-0.0049, -0.2325, -0.3725,  ..., -2.0357, -1.9832, -1.8081],
      [ 1.2206,  0.8004,  0.1352,  ..., -2.0357, -1.9132, -1.7731],
      [ 1.4657,  1.4832,  1.3081,  ..., -2.0357, -1.9132, -1.6856],
      ...,
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-1.9832, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111]],

     [[-0.8981, -1.1247, -1.2641,  ..., -1.8044, -1.6302, -1.4210],
      [ 0.2696, -0.1138, -0.7587,  ..., -1.7522, -1.6302, -1.4036],
      [ 0.4962,  0.5136,  0.3568,  ..., -1.6824, -1.6302, -1.4036],
      ...,
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.7522, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877]]]],
   device='cuda:0')

tensor([[[[ 0.2282, 0.2453, 0.2453, ..., 1.3070, 1.3242, 1.3242],
[ 0.2453, 0.2453, 0.2453, ..., 1.3070, 1.3070, 1.3584],
[ 0.2282, 0.2453, 0.2453, ..., 1.3242, 1.3242, 1.3584],
...,
[-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489],
[-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489],
[-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],

     [[-0.3375, -0.3375, -0.3200,  ..., -2.0357, -2.0357, -2.0357],
      [-0.3375, -0.3375, -0.3375,  ..., -2.0357, -2.0357, -2.0357],
      [-0.3725, -0.3375, -0.3375,  ..., -2.0357, -2.0357, -2.0357],
      ...,
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111]],

     [[-0.9678, -0.9330, -0.9330,  ..., -1.8044, -1.7522, -1.8044],
      [-0.9504, -0.9504, -0.9504,  ..., -1.8044, -1.7522, -1.7522],
      [-1.0027, -0.9678, -0.9504,  ..., -1.7522, -1.7522, -1.6824],
      ...,
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.8044, -1.8044, -1.8044,  ...,  2.6051,  2.5877,  2.5877]]]],
   device='cuda:0')

tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0',
grad_fn=)
tensor(nan, device='cuda:0', grad_fn=)
2024-10-03 11:53:39,003 INFO Training Epoch: 9/50 Iter: 947/5661 Loss:nan(nan) LR:6.7625e-04 DataTime:0.12 InferTime:43.17ms Time cost: 06:46/33:37:36
l/OpenStereo/./stereo/utils/common_utils.py:198: RuntimeWarning: invalid value encountered in cast
pred_tmp = cm(pred_tmp.astype('uint8'))
//OpenStereo/./stereo/utils/common_utils.py:199: RuntimeWarning: invalid value encountered in cast
error_map_tmp = cm(error_map_tmp.astype('uint8'))
请问一下，这是数据有问题吗？但是现在还不知道怎么排查数据，是左右目没有对齐吗？

The text was updated successfully, but these errors were encountered:

t973288913 · 2024-10-26T01:45:22Z

一样的情况，模型输出全为0，应该怎么处理？

XiandaGuo · 2024-10-26T02:36:33Z

The cfg will be released soon.

Dongxin000 · 2024-11-07T05:57:47Z

我也是自己的数据集出现了相同的问题，在第9个epoch loss出现了nan

yangning6103 · 2024-11-07T09:52:24Z

一样的情况，模型输出全为0，应该怎么处理？

可以暂时将配置文件中的LEFT_ATT 置为false

zjuPeco · 2024-11-11T05:59:14Z

一样的情况IGEV配置文件，SceneFlow+Fat数据集

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用IRS数据训练，前面训练正常，但是到第9epoch时output为nan，不知道怎么排查 #141

使用IRS数据训练，前面训练正常，但是到第9epoch时output为nan，不知道怎么排查 #141

yangning6103 commented Oct 3, 2024

t973288913 commented Oct 26, 2024

XiandaGuo commented Oct 26, 2024

Dongxin000 commented Nov 7, 2024

yangning6103 commented Nov 7, 2024

zjuPeco commented Nov 11, 2024

使用IRS数据训练，前面训练正常，但是到第9epoch时output为nan，不知道怎么排查 #141

使用IRS数据训练，前面训练正常，但是到第9epoch时output为nan，不知道怎么排查 #141

Comments

yangning6103 commented Oct 3, 2024

t973288913 commented Oct 26, 2024

XiandaGuo commented Oct 26, 2024

Dongxin000 commented Nov 7, 2024

yangning6103 commented Nov 7, 2024

zjuPeco commented Nov 11, 2024