Why MetaAdam found tensors on two devices while updating ? #116

maoliyuan · 2022-11-16T12:51:15Z

maoliyuan
Nov 16, 2022

I use MetaAdam to complete inner training loop, everything was right and all tensors and models are on CUDA before self.qf1_optimizer.step(qf1_loss), I print the device of qf1_loss and found that it's on CUDA, I don't know why MetaAdam found tensor on cpu, here's my code using MetaAdam:

meta_optimizer_class=TorchOpt.MetaAdam
self.policy_optimizer = meta_optimizer_class(
    self.policy, lr=policy_lr, betas=(beta_1, 0.999), moment_requires_grad=False
)

and here is error information:

File "/NAS2020/Workspaces/DRLGroup/lymao/DLproject/h_divergence_meta_learning/ILSwiss/rlkit/torch/algorithms/bmg/bmg.py", line 153, in train_step
    self.qf1_optimizer.step(qf1_loss)
  File "/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torchopt/_src/optimizer/meta/base.py", line 69, in step
    updates, new_state = self.impl.update(
  File "/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torchopt/_src/transform.py", line 66, in update_fn
    flattened_updates, state = inner.update(
  File "/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torchopt/_src/base.py", line 183, in update_fn
    updates, new_s = fn(updates, s, params=params, inplace=inplace)
  File "/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torchopt/_src/transform.py", line 318, in update_fn
    mu = _update_moment(
  File "/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torchopt/_src/transform.py", line 221, in _update_moment
    return map_flattened(f, updates, moments)
  File "/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torchopt/_src/transform.py", line 51, in map_flattened
    return list(map(func, *args))
  File "/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torchopt/_src/transform.py", line 218, in f
    return t.mul(decay).add_(g, alpha=1 - decay) if g is not None else t
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Answered by XuehaiPan

Nov 16, 2022

@maoliyuan Hi, could you try your code with our latest dev version? We have built new wheels via GitHub Action, the artifacts can be found here Build #393. Download the wheels artifact (py38 / py39 / py310).

Please follow the instructions on https://pytorch.org/ to upgrade your torch installation to 1.13.0:

pip3 install torch torchvision torchaudio

Then install the wheel:

pip3 install torchopt-0.5.1.dev49+ga89bd4e-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Thanks.

View full answer

Benjamin-eecs · 2022-11-16T14:28:05Z

Benjamin-eecs
Nov 16, 2022
Maintainer

Hi @maoliyuan, I was wondering which version of torchopt you used that produced such bug report?

You can try the following code snippet to get that

import torchopt, numpy, sys
print(torchopt.__version__, numpy.__version__, sys.version, sys.platform)

1 reply

maoliyuan Nov 16, 2022
Author

Thank you for your reply! here's my results:

XuehaiPan · 2022-11-16T17:36:27Z

XuehaiPan
Nov 16, 2022
Maintainer

@maoliyuan Hi, could you try your code with our latest dev version? We have built new wheels via GitHub Action, the artifacts can be found here Build #393. Download the wheels artifact (py38 / py39 / py310).

Please follow the instructions on https://pytorch.org/ to upgrade your torch installation to 1.13.0:

pip3 install torch torchvision torchaudio

Then install the wheel:

pip3 install torchopt-0.5.1.dev49+ga89bd4e-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Thanks.

3 replies

maoliyuan Nov 17, 2022
Author

Thanks for reply! First I tried your method but still get the same error that MetaAdam found two devices, below is my version of torch opt, torch and cuda, then I tried to replace MetaAdam with MetaSGD and it works, so I believe that something wrong happens when calculating momentum using MetaAdam and thus cause the problem, it may be a bug but I'm not sure.

XuehaiPan Nov 17, 2022
Maintainer

I tried to replace MetaAdam with MetaSGD and it works.

Thanks for the reply. We will investigate and fix it soon.

XuehaiPan Nov 17, 2022
Maintainer

@maoliyuan Did you use .cuda() / .cpu() changed your module device after creating the meta-optimizer?

module = ...
meta_optimizer = MetaAdam(module, lr)
module = module.cuda()

# Correct code
module = ...
module = module.cuda()
meta_optimizer = MetaAdam(module, lr)

Could you provide a minimal script to reproduce your issue?

maoliyuan · 2022-11-17T14:47:39Z

maoliyuan
Nov 17, 2022
Author

Thanks for reply! This problem was solved and I'm sorry that I didn't pay attention to the order of defining optimizer and loading model to CUDA, but I get another problem when using MetaAdam, In my code I didn't use sqrt function to compute loss, but in the outer loop when I backward with torch.optim.Adam I get error of "RuntimeError: Function 'SqrtBackward0' returned nan values in its 0th output.", then I switched MetaAdam to MetaSGD and nothing wrong happened , Is there anything wrong with my code of using MetaAdam? here's the full error information and my script:

below is the full error information:

/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torch/autograd/__init__.py:197: UserWarning: Error detected in SqrtBackward0. No forward pass information available. Enable detect anomaly during forward pass for more information. (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:92.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "run_scripts/bmg_exp_script.py", line 176, in <module>
    experiment(exp_specs)
  File "run_scripts/bmg_exp_script.py", line 134, in experiment
    algorithm.train(start_epoch=epoch)
  File "/NAS2020/Workspaces/DRLGroup/lymao/DLproject/h_divergence_meta_learning/ILSwiss/rlkit/core/base_algorithm.py", line 162, in train
    self.start_training(start_epoch=start_epoch)
  File "/NAS2020/Workspaces/DRLGroup/lymao/DLproject/h_divergence_meta_learning/ILSwiss/rlkit/core/base_algorithm.py", line 290, in start_training
    self._try_to_train(epoch)
  File "/NAS2020/Workspaces/DRLGroup/lymao/DLproject/h_divergence_meta_learning/ILSwiss/rlkit/core/base_algorithm.py", line 302, in _try_to_train
    self._do_training(epoch)
  File "/NAS2020/Workspaces/DRLGroup/lymao/DLproject/h_divergence_meta_learning/ILSwiss/rlkit/torch/algorithms/torch_meta_rl_algorithm.py", line 49, in _do_training
    self.trainer.train_step(self.get_batch(), self.inner_train_steps_total, avg_reward_per_iter)
  File "/NAS2020/Workspaces/DRLGroup/lymao/DLproject/h_divergence_meta_learning/ILSwiss/rlkit/torch/algorithms/bmg/bmg.py", line 210, in train_step
    matching_loss.backward()
  File "/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/lymao/anaconda3/envs/meta/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'SqrtBackward0' returned nan values in its 0th output.

below is my training script

define optimizer out of the trainer
meta_optimizer_class = torchopt.MetaAdam
algorithm.trainer.policy_optimizer = meta_optimizer_class(
        policy, lr=0.0001, moment_requires_grad=False
    )

train loop corresponds to Bootstrapped Meta Gradient

update policy(using MetaAdam) with some loss that doesn't contain sqrt
 if n_train_step_total % self.num_steps_per_loop == self.inner_loop_steps-1:
            self.k_state_dict = TorchOpt.extract_state_dict(self.policy)
if n_train_step_total % self.num_steps_per_loop == self.total_steps_per_loop-2:
            self.k_l_m1_state_dict = TorchOpt.extract_state_dict(self.policy)
if n_train_step_total % self.num_steps_per_loop == self.total_steps_per_loop-1:
            matching_loss = self.matching_function(self.policy_k, self.policy, obs, self.k_state_dict)

**here policy is just mlp without any sqrt function

def matching_function(self, policy_k, tb, meta_observations, policy_k_state_dict):**
        with torch.no_grad(): **here I don't want to calculate grad of self.policy**
            \tab policy_outputs_tb = tb(meta_observations)
            \tab policy_mean_tb, policy_log_std_tb = policy_outputs_tb[1], policy_outputs_tb[2]
        TorchOpt.recover_state_dict(policy_k, policy_k_state_dict)
        policy_outputs_k = policy_k(meta_observations)
        policy_mean_k, policy_log_std_k = policy_outputs_k[1], policy_outputs_k[2]
        div = self.matching_mean_coef * self.matching_loss(policy_mean_tb, policy_mean_k) + self.matching_std_coef * self.matching_loss(policy_log_std_tb, policy_log_std_k)
        return div

I'm sorry that I can't give you a pretty code and error information because I'm not familiar with Github's discussion, but I still hope you can help me!

0 replies

waterhorse1 · 2022-11-17T15:00:53Z

waterhorse1
Nov 17, 2022
Maintainer

#26, you can refer to this issue for the NAN bug in MetaAdam and the reasons for getting NAN. You can either set the use_accelerated_op as True, or register a hook to fliter the NAN bug. Here is an example.

impl = torchopt.chain(torchopt.hook.register_hook(torchopt.hook.zero_nan_hook), torchopt.adam(1e-1))
inner_opt = torchopt.MetaOptimizer(net, impl)

1 reply

maoliyuan Nov 17, 2022
Author

Thank you very much! It works for me!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why MetaAdam found tensors on two devices while updating ? #116

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Why MetaAdam found tensors on two devices while updating ? #116

maoliyuan Nov 16, 2022

Replies: 4 comments · 5 replies

Benjamin-eecs Nov 16, 2022 Maintainer

maoliyuan Nov 16, 2022 Author

XuehaiPan Nov 16, 2022 Maintainer

maoliyuan Nov 17, 2022 Author

XuehaiPan Nov 17, 2022 Maintainer

XuehaiPan Nov 17, 2022 Maintainer

maoliyuan Nov 17, 2022 Author

waterhorse1 Nov 17, 2022 Maintainer

maoliyuan Nov 17, 2022 Author

maoliyuan
Nov 16, 2022

Replies: 4 comments 5 replies

Benjamin-eecs
Nov 16, 2022
Maintainer

maoliyuan Nov 16, 2022
Author

XuehaiPan
Nov 16, 2022
Maintainer

maoliyuan Nov 17, 2022
Author

XuehaiPan Nov 17, 2022
Maintainer

XuehaiPan Nov 17, 2022
Maintainer

maoliyuan
Nov 17, 2022
Author

waterhorse1
Nov 17, 2022
Maintainer

maoliyuan Nov 17, 2022
Author