Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error:mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault) #1

Open
windwm opened this issue Jul 15, 2019 · 4 comments
Assignees

Comments

@windwm
Copy link

windwm commented Jul 15, 2019

Hello,

I train the mnist and cifar10 successfully with a single GPU using the chainerkfac. But when I use the chainerkfac to train mnist and cifar10 with multiple GPUs, I met this problem.

mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault).

The command I used as follow:
mpirun -np 4 python train.py --distributed

@y1r y1r self-assigned this Jul 18, 2019
@y1r
Copy link
Collaborator

y1r commented Jul 18, 2019

Hi @windwm,
Thank you for trying our K-FAC implementation.

In my environment, I can train mnist with two GPUs using chainerkfac.

$ pip freeze
chainer==7.0.0b1
chainerkfac==0.1
cupy-cuda101==6.1.0
fastrlock==0.4
filelock==3.0.12
mpi4py==3.0.2
numpy==1.17.0rc2
protobuf==3.7.1
six==1.12.0
typing==3.6.6
typing-extensions==3.6.6

First of all, can you train mnist with multi GPUs without K-FAC like this?

$ cd chainer/examples/chainermn/mnist
$ mpirun -np 2 python train_mnist.py --communicator pure_nccl --gpu

@windwm
Copy link
Author

windwm commented Jul 19, 2019

Thank you for your answer. According to your advice, I try the command:

cd chainerkfac/examples/mnist
mpirun -np 2 python train.py --communicator pure_nccl --gpu

Then I got the error:

usage: train.py [-h] [--batch_size BATCH_SIZE] [--num_epochs NUM_EPOCHS]
                [--snapshot_interval SNAPSHOT_INTERVAL] [--no_cuda]
                [--out OUT] [--resume RESUME] [--optimizer OPTIMIZER]
                [--arch {mlp,cnn}] [--plot] [--distributed]
train.py: error: unrecognized arguments: --communicator pure_nccl --gpu

My environment.

chainer==7.0.0b1
chainerkfac==0.1
cupy-cuda101==6.1.0
fastrlock==0.4
filelock==3.0.12
mpi4py==3.0.2
numpy==1.17.0rc1
protobuf==3.7.1
six==1.12.0
typing==3.6.6
typing-extensions==3.6.6

@y1r
Copy link
Collaborator

y1r commented Jul 19, 2019

Please try without chainerkfac.
I mean you can try multi-GPU MNIST training example provided by Chainer.

@windwm
Copy link
Author

windwm commented Jul 19, 2019

Yes, I train mnist sucessfully using 2 GPUs without K-FAC. But when I try to train MNIST, cifar10 and imagenet with chainerkfac, I still met this problem.
mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants