Here we only provide a guide to launch distributed training with singularity, please make sure your singularity works by checking INSTALL.md
- obtain the mxnet launcher and place it in the parent directory of the simpledet working directory
git clone https://github.com/RogerChern/mxnet-dist-lancher.git lancher
-
mv
data
,pretrain_model
,experiments
outside of simpledet and symink them back. This step is to avoid unnecessaryrsync
of large binary files in the working directory during launching. -
after step 1 and 2, your directory should be as following
lancher/
simpledet/
data -> /path/to/data
pretrain_model -> /path/to/pretain_model
experiments -> /path/to/experiments
...
- make a hostfile containing hostnames of all nodes, these nodes would be accessed from our launch node by ssh without password simpledet/hostfile.txt
node1
node2
-
change the singulariy mounting point in
scripts/dist_worker.sh
-
launch distributed training with scripts
bash scritps/launch.sh config/mask_r50v1_fpn_1x.py node1,node2