Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG VPG: pi_info_old seems to be the same as pi_info #424

Open
tesla-cat opened this issue Oct 20, 2024 · 1 comment
Open

BUG VPG: pi_info_old seems to be the same as pi_info #424

tesla-cat opened this issue Oct 20, 2024 · 1 comment

Comments

@tesla-cat
Copy link

        # Get loss and info values before update
        pi_l_old, pi_info_old = compute_loss_pi(data)                   # line A
        pi_l_old = pi_l_old.item()
        v_l_old = compute_loss_v(data).item()

        # Train policy with a single step of gradient descent
        pi_optimizer.zero_grad()
        loss_pi, pi_info = compute_loss_pi(data)                           # line B
        loss_pi.backward()
        mpi_avg_grads(ac.pi)    # average grads across MPI processes
        pi_optimizer.step()                                                            # line C
  • i think the parameter updates happen at line C right? therefore the NN params didn't change between line A and line B, so pi_info_old should be the same as pi_info

  • similarly, policy params, obs, act all didn't change when calculating logp_old and logp, so shouldn't they be the same?

    def compute_loss_pi(data):
        obs, act, adv, logp_old = data['obs'], data['act'], data['adv'], data['logp']

        # Policy loss
        pi, logp = ac.pi(obs, act)
        loss_pi = -(logp * adv).mean()

        # Useful extra info
        approx_kl = (logp_old - logp).mean().item()
        ent = pi.entropy().mean().item()
        pi_info = dict(kl=approx_kl, ent=ent)

        return loss_pi, pi_info
@tesla-cat
Copy link
Author

after looking at the PPO implementation, one can confirm that this indeed doesnt make sense when you only train policy for one iter, it only makes sense with multiple iters of policy updates as seen in PPO

I made a repo to share the cleaned, minimalist version of the spinup implementations

https://github.com/tesla-cat/minRL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant