You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Get loss and info values before updatepi_l_old, pi_info_old=compute_loss_pi(data) # line Api_l_old=pi_l_old.item()
v_l_old=compute_loss_v(data).item()
# Train policy with a single step of gradient descentpi_optimizer.zero_grad()
loss_pi, pi_info=compute_loss_pi(data) # line Bloss_pi.backward()
mpi_avg_grads(ac.pi) # average grads across MPI processespi_optimizer.step() # line C
i think the parameter updates happen at line C right? therefore the NN params didn't change between line A and line B, so pi_info_old should be the same as pi_info
similarly, policy params, obs, act all didn't change when calculating logp_old and logp, so shouldn't they be the same?
after looking at the PPO implementation, one can confirm that this indeed doesnt make sense when you only train policy for one iter, it only makes sense with multiple iters of policy updates as seen in PPO
I made a repo to share the cleaned, minimalist version of the spinup implementations
i think the parameter updates happen at
line C
right? therefore the NN params didn't change betweenline A
andline B
, sopi_info_old
should be the same aspi_info
similarly,
policy params
,obs
,act
all didn't change when calculatinglogp_old
andlogp
, so shouldn't they be the same?The text was updated successfully, but these errors were encountered: