You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
Looking at the following lines in RayLauncher:_recover_results_in_main_process():
ifray_output.weights_pathisnotNone:
state_stream=ray_output.weights_path# DDPSpawnPlugin.__recover_child_process_weights begin# Difference here is that instead of writing the model weights to a# file and loading it, we use the state dict of the model directly.state_dict=load_state_stream(state_stream, to_gpu=self._strategy.use_gpu)
# Set the state for PTL using the output from remote training.trainer.lightning_module.load_state_dict(state_dict)
If I probe for what state_dict is vs trainer.lightning_module.state_dict() it seems that the latter is completely empty, it just outputs OrderedDict(). The former has all the weights listed in the error with actual data. So for some reason the lightning module is not being set up (or something like that?) for it to have no state. This is not an issue when I don't use ray_lightning for 1-gpu-per-trial and just normal ray[tune].
I am trying to get multi-gpu training working for running tuning with Ray[Tune]. However, I am getting the following error:
Looking at the following lines in
RayLauncher:_recover_results_in_main_process()
:If I probe for what
state_dict
is vstrainer.lightning_module.state_dict()
it seems that the latter is completely empty, it just outputsOrderedDict()
. The former has all the weights listed in the error with actual data. So for some reason the lightning module is not being set up (or something like that?) for it to have no state. This is not an issue when I don't use ray_lightning for 1-gpu-per-trial and just normal ray[tune].For reference of how I'm running tuning.
Other info:
Python 3.9.15
OS: Ubuntu 18.04.4 LTS (Bionic Beaver)
Other relevant information:
cudatoolkit=10.2
pytorch=1.12.1
pytorch-lightning=1.6.5
cudnn=7.6.5
Specs:
4 GeForce RTX 2080 Ti's
32 CPUs (x86_64)
The text was updated successfully, but these errors were encountered: