Bad training result after 90 epochs #15

rginpan · 2021-08-04T07:28:16Z

Due to my GPU limitation, parameters are:
sequence-size: 3 # must be >= 1
combinations: [ [ 0, 1 ],[0,2],[1,2] ]
batch size:2
epochs:90

I trained the lidar with imu network, using all the training dataset in original config. I also plot the loss curve and test result in 2011-10-03: [27] for you to check.

What wrong with my implementation?

Looking forward to your reply, thanks ahead!

rginpan · 2021-08-05T02:04:52Z

I used only one sequence to train and valid (I know it will overfit), through the loss curve, it already convergenced, however the trajectory is still very bad...
Could you give me some instructions about how to train and evaluate the loss curve?
Thanks ahead!!!

rginpan · 2021-08-05T14:11:44Z

@ArashJavan sorry for disturbing, if you did not know the reason, could you please give me your training hyper parameters.

ArashJavan · 2021-08-06T07:16:00Z

@rginjapan your sequence is not correct for an RNN-based architecture. The combination should be:

combinations: [ [ 0, 1 ],[1, 2],[2,3] ]

Also, I would first try with only one Network for example only IMU to see if everything works fine. Then switch to more complex networks with both lidar and imu fused!

Another hint is about your grad-norm, if the above points did not help try to use gradient-clipping.

rginpan · 2021-08-06T07:24:19Z

@ArashJavan Thanks for your reply, I think "sequences" means images numbers, for example, sequences:3 represents 3 continuous frames I0, I1 and I2. I remember you explain in other close issue, "combinations [0,2]" will increase the relative motion compared with "combination [0,1]" , which is good for training. Am I correct ?

"Another hint is about your grad-norm, if the above points did not help try to use gradient-clipping." Sorry, can you explain in more details, I did not understand.

ArashJavan · 2021-08-06T08:29:42Z

@rginjapan yes and no ;) for RNN-based network, it is better to have consecutive frame combination, e.g. [[0,1], [1,2], [2,3]].

Regarding gradient-clipping: see here, and here.

rginpan · 2021-08-06T08:36:54Z

@ArashJavan Thanks for your quick reply. Right now I understand, I already used [0,1],[0,2],[1,2] to get good result in only lidar network, so then I thought it will work on lidar+imu network, but not... So, do you evaluate the difference between large sequence number like 5 in your config and samll number like 2? Since [0,1] and [4,5] is similar in training, right?

ArashJavan · 2021-08-06T19:25:40Z

@rginjapan plz look at the loss function here. As you can see in the loss function we already take care for global motion since the loss is calculated for local - e.g. [x_i, x_i+1] and global -e.g. [x0, x1],...,[x0, x_seq].

rginpan · 2021-08-10T01:59:17Z

After I change the combinations and training 100 epochs, the result is still not good. What shoud I do, increasing the epochs numbers ? What is the ideal convergenced loss?? right now the loss is around 16 and does not decrease.

ArashJavan · 2021-08-10T06:47:59Z

@rginjapan Can you share your config-file and your meta-data settings (lr, wd and so on).

rginpan · 2021-08-10T07:17:47Z

config.yaml

datasets:
sequence-size: 3 # must be >= 1
combinations: [ [ 0, 1 ], [1,2], [2,3]]

kitti:
root-path-sync: "datasets/KITTI/sync"
root-path-unsync: "datasets/KITTI/extract"
image-width: 720
image-height: 57
crop-factors: [0, 0] # [0, 4] # top, left
fov-up: 3.
fov-down: -25.
max-depth: 80.
min-depth: 1.
inverse-depth: true
train:
2011-10-03: [ 27, 42, 34 ]
2011-09-30: [ 16, 18, 20, 27, 28 ]
test:

  2011-10-03: [ 27, 42, 34 ]
  2011-09-30: [ 16, 18, 20, 27, 28, 33, 34 ]
validation:
  #2011-09-26: [23, 39]
  2011-09-30: [ 33, 34 ]

  # channesl: x, y, z, remission, nx, ny ,nz, range)
mean-image: [-0.0014, 0.0043, -0.011, 0.2258, -0.0024, 0.0037, 0.3793, 0.1115]
std-image: [0.1269, 0.0951, 0.0108, 0.1758, 0.3436, 0.4445, 0.5664, 0.0884]

mean-imu: [-0.0685, 0.1672, 9.7967, -0., 0.0006, 0.0059]
std-imu: [0.8766, 0.9528, 0.3471, 0.0204, 0.0227, 0.1412]

DeepLIO Network

deeplio:
dropout: 0.25
pretrained: true
model-path: "/home/deeplio/outputs/train_deeplio_20210806_174151/cpkt_deeplio_best.tar"
lidar-feat-net:
name: "lidar-feat-pointseg"
pretrained: true
model-path: "/home/deeplio/outputs/train_deeplio_20210806_174151/cpkt_lidarpointsegfeat_best.tar"
requires-grad: true
imu-feat-net:
name: "imu-feat-rnn"
pretrained: true
model-path: "/home/deeplio/outputs/train_deeplio_20210806_174151/cpkt_imufeatrnn0_best.tar"
requires-grad: true
odom-feat-net:
name: "odom-feat-rnn"
pretrained: true
model-path: "/home/deeplio/outputs/train_deeplio_20210806_174151/cpkt_odomfeatrnn_best.tar"
requires-grad: true
fusion-net:
name: "fusion-layer-soft"
pretrained: true
model-path: "/home/deeplio/outputs/train_deeplio_20210806_174151/cpkt_deepliofusionsoft_best.tar"
requires-grad: true # only soft-fusion has trainable params

Lidar Feature Netowrks

lidar-feat-pointseg: # pointseg feature
dropout: 0.1
classes: ['unknown', 'object']
bypass: "simple"
fusion: add # [cat, sub, add]
part: "encoder" # [encoder]

lidar-feat-flownet:
dropout: 0.
fusion: add # [cat, sub, add]

lidar-feat-resnet:
dropout: 0.25
fusion: add # [cat, sub, add]

lidar-feat-simple-1:
dropout: 0.25
fusion: add # [cat, sub, add]
bypass: false

imu-feat-fc: # FC
input-size: 6 # !fixed! do not chanage
hidden-size: [128, 256, 512, 512, 256, 128]
dropout: 0.

imu-feat-rnn: # RNN
type: "lstm"
input-size: 6 # !fixed! do not chanage
hidden-size: 128
num-layers: 2
bidirectional: true
dropout: 0.1

fusion-layer-soft:
type: "soft"

Odometry Feature Netowrks

odometry feature network with fully connected layers

odom-feat-fc:
size: [1024, 512, 256]
dropout: 0.

odometry feature network with rnn-layers

odom-feat-rnn:
type: "lstm"
hidden-size: 1024
num-layers: 2
bidirectional: true
dropout: 0.

Loss Configurations

losses:
active: 'hwsloss'
hwsloss:
params:
learn: true
sx: 0.
sq: -3.
lwsloss:
params:
beta: 1125.
loss-type: "local+global" # ["local", "global", "local+global"]

current-dataset: 'kitti'
channels: [0, 1, 2, 4, 5, 6] # channesl: x, y, z, remission, nx, ny. nz range
optimizer: 'adam'

hyper params in train.py

parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
help='number of data loading workers (default: 4)')
parser.add_argument('--epochs', default=100, type=int, metavar='N',
help='number of total epochs to run')
parser.add_argument('-b', '--batch-size', default=2, type=int,
metavar='N',
help='mini-batch size (default: 1), this is the total '
'batch size of all GPUs on the current node when '
'using Data Parallel or Distributed Data Parallel')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
help='manual epoch number (useful on restarts)')
parser.add_argument('--lr', '--learning-rate', default=1e-3, type=float,
metavar='LR', help='initial learning rate', dest='lr')
parser.add_argument('--lr-decay', '--learning-rate-decay', default=30, type=int,
metavar='LR-DECAY-STEP', help='learning rate decay step', dest='lr_decay')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
help='momentum')
parser.add_argument('--wd', '--weight-decay', default=1e-4, type=float,
metavar='W', help='weight decay (default: 1e-4)',
dest='weight_decay')
parser.add_argument('-p', '--print-freq', default=10, type=int,
metavar='N', help='print frequency (default: 10)')
parser.add_argument('--resume', default=False, action='store_true', dest='resume',
help='resume training')
parser.add_argument('-e', '--evaluate', default='', type=str, metavar='PATH',
help='evaluate model with given checkpoint on validation set (default: none)')
parser.add_argument('-c', '--config', default="./config.yaml", help='Path to configuration file')
parser.add_argument('-d', '--debug', default=False, help='debug logging', action='store_true', dest='debug')
parser.add_argument('--device', default='cuda', type=str, metavar='DEVICE',
help='Device to use [cpu, cuda].')

rginpan · 2021-08-11T06:35:47Z

Could you please upload your trained model to let me try to get some good result, it is not to be as good as your paper's result.

P.S. I increased the epochs to 150, the loss can not be decreased, still around 16. And also the result is still bad... Looking forward to your guidance.

huazai665 · 2022-11-11T08:41:05Z

Due to my GPU limitation, parameters are: sequence-size: 3 # must be >= 1 combinations: [ [ 0, 1 ],[0,2],[1,2] ] batch size:2 epochs:90

I trained the lidar with imu network, using all the training dataset in original config. I also plot the loss curve and test result in 2011-10-03: [27] for you to check.

What wrong with my implementation?

Looking forward to your reply, thanks ahead!
hello, I have the same question. The result is bad and the loss is negative. Do you solve the problem?

rginpan · 2022-11-11T09:08:19Z

@huazai665 I cannot reproduce the result in the paper in the end.

huazai665 · 2022-11-16T09:03:15Z

@rginjapan do you revise the liegroups library to use cuda?

hu-xue · 2024-08-21T14:26:13Z

@rginjapan Hi, have you solve the worse result issue? I think I meet the same question with you. If you are doing this work or some fantastic works in this filed, meybe we can exchange contact information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad training result after 90 epochs #15

Bad training result after 90 epochs #15

rginpan commented Aug 4, 2021 •

edited

Loading

rginpan commented Aug 5, 2021

rginpan commented Aug 5, 2021

ArashJavan commented Aug 6, 2021 •

edited

Loading

rginpan commented Aug 6, 2021 •

edited

Loading

ArashJavan commented Aug 6, 2021

rginpan commented Aug 6, 2021

ArashJavan commented Aug 6, 2021

rginpan commented Aug 10, 2021 •

edited

Loading

ArashJavan commented Aug 10, 2021

rginpan commented Aug 10, 2021 •

edited

Loading

rginpan commented Aug 11, 2021 •

edited

Loading

huazai665 commented Nov 11, 2022

rginpan commented Nov 11, 2022

huazai665 commented Nov 16, 2022

hu-xue commented Aug 21, 2024

Bad training result after 90 epochs #15

Bad training result after 90 epochs #15

Comments

rginpan commented Aug 4, 2021 • edited Loading

rginpan commented Aug 5, 2021

rginpan commented Aug 5, 2021

ArashJavan commented Aug 6, 2021 • edited Loading

rginpan commented Aug 6, 2021 • edited Loading

ArashJavan commented Aug 6, 2021

rginpan commented Aug 6, 2021

ArashJavan commented Aug 6, 2021

rginpan commented Aug 10, 2021 • edited Loading

ArashJavan commented Aug 10, 2021

rginpan commented Aug 10, 2021 • edited Loading

DeepLIO Network

Lidar Feature Netowrks

Odometry Feature Netowrks

odometry feature network with fully connected layers

odometry feature network with rnn-layers

Loss Configurations

rginpan commented Aug 11, 2021 • edited Loading

huazai665 commented Nov 11, 2022

rginpan commented Nov 11, 2022

huazai665 commented Nov 16, 2022

hu-xue commented Aug 21, 2024

rginpan commented Aug 4, 2021 •

edited

Loading

ArashJavan commented Aug 6, 2021 •

edited

Loading

rginpan commented Aug 6, 2021 •

edited

Loading

rginpan commented Aug 10, 2021 •

edited

Loading

rginpan commented Aug 10, 2021 •

edited

Loading

rginpan commented Aug 11, 2021 •

edited

Loading