-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I set up multi-GPU parallel training? #388
Comments
if i use ----> 9 super(RegressionModel, mod).train(max_epochs=100, batch_size=2500, train_size=1, lr=0.002,device=4) { File ~/anaconda3/envs/cellana/envs/cell2loc_env/lib/python3.10/site-packages/scvi/model/base/_pyromixin.py:194, in PyroSviTrainMixin.train(self, max_epochs, accelerator, device, train_size, validation_size, shuffle_set_split, batch_size, early_stopping, lr, training_plan, datasplitter_kwargs, plan_kwargs, **trainer_kwargs) File ~/anaconda3/envs/cellana/envs/cell2loc_env/lib/python3.10/site-packages/scvi/train/_trainrunner.py:96, in TrainRunner.call(self) File ~/anaconda3/envs/cellana/envs/cell2loc_env/lib/python3.10/site-packages/scvi/train/_trainer.py:201, in Trainer.fit(self, *args, **kwargs) File ~/anaconda3/envs/cellana/envs/cell2loc_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:538, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) File ~/anaconda3/envs/cellana/envs/cell2loc_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:46, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs) File ~/anaconda3/envs/cellana/envs/cell2loc_env/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py:144, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs) File ~/anaconda3/envs/cellana/envs/cell2loc_env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:170, in ProcessContext.join(self, timeout) ProcessExitedException: process 3 terminated with signal SIGSEGV" |
Multi-GPU training is not needed for the regression model (batch size is never too large for most GPUs) and quite non-trivial for the cell2location model. It is non-trivial because you need to keep location-specific parameters not just location-specific data on different GPU devices and it has to be full data rather than minibatch training. |
How do I set up multi-GPU parallel training?
The text was updated successfully, but these errors were encountered: