Error:mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault) #1

windwm · 2019-07-15T11:28:09Z

Hello,

I train the mnist and cifar10 successfully with a single GPU using the chainerkfac. But when I use the chainerkfac to train mnist and cifar10 with multiple GPUs, I met this problem.

mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault).

The command I used as follow:
mpirun -np 4 python train.py --distributed

y1r · 2019-07-18T01:02:46Z

Hi @windwm,
Thank you for trying our K-FAC implementation.

In my environment, I can train mnist with two GPUs using chainerkfac.

$ pip freeze
chainer==7.0.0b1
chainerkfac==0.1
cupy-cuda101==6.1.0
fastrlock==0.4
filelock==3.0.12
mpi4py==3.0.2
numpy==1.17.0rc2
protobuf==3.7.1
six==1.12.0
typing==3.6.6
typing-extensions==3.6.6

First of all, can you train mnist with multi GPUs without K-FAC like this?

$ cd chainer/examples/chainermn/mnist
$ mpirun -np 2 python train_mnist.py --communicator pure_nccl --gpu

windwm · 2019-07-19T06:17:59Z

Thank you for your answer. According to your advice, I try the command:

cd chainerkfac/examples/mnist
mpirun -np 2 python train.py --communicator pure_nccl --gpu

Then I got the error:

usage: train.py [-h] [--batch_size BATCH_SIZE] [--num_epochs NUM_EPOCHS]
                [--snapshot_interval SNAPSHOT_INTERVAL] [--no_cuda]
                [--out OUT] [--resume RESUME] [--optimizer OPTIMIZER]
                [--arch {mlp,cnn}] [--plot] [--distributed]
train.py: error: unrecognized arguments: --communicator pure_nccl --gpu

My environment.

chainer==7.0.0b1
chainerkfac==0.1
cupy-cuda101==6.1.0
fastrlock==0.4
filelock==3.0.12
mpi4py==3.0.2
numpy==1.17.0rc1
protobuf==3.7.1
six==1.12.0
typing==3.6.6
typing-extensions==3.6.6

y1r · 2019-07-19T06:20:19Z

Please try without chainerkfac.
I mean you can try multi-GPU MNIST training example provided by Chainer.

windwm · 2019-07-19T08:17:09Z

Yes, I train mnist sucessfully using 2 GPUs without K-FAC. But when I try to train MNIST, cifar10 and imagenet with chainerkfac, I still met this problem.
mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault).

y1r self-assigned this Jul 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error:mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault) #1

Error:mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault) #1

windwm commented Jul 15, 2019

y1r commented Jul 18, 2019

windwm commented Jul 19, 2019 •

edited

Loading

y1r commented Jul 19, 2019

windwm commented Jul 19, 2019

Error:mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault) #1

Error:mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault) #1

Comments

windwm commented Jul 15, 2019

y1r commented Jul 18, 2019

windwm commented Jul 19, 2019 • edited Loading

y1r commented Jul 19, 2019

windwm commented Jul 19, 2019

windwm commented Jul 19, 2019 •

edited

Loading