-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New A2C example with entropy #26
base: master
Are you sure you want to change the base?
New A2C example with entropy #26
Conversation
I've just pushed 3 more commits, the changes to first one are:
I followed the paper Asynchronous Methods for Deep Reinforcement Learning and with these changes the implementation should be identical to the baselines one. |
…ckward through the bootstrap path
Thanks again for a nice PR @galatolofederico ! I would like to keep a copy of the original actor-critic example, as it's useful for newcomers to compare against the PyTorch examples implementation. What do you think about creating a folder When you say it's identical to the baselines one, have you tried it on Atari or MuJoCo/PyBullet ? I've never got to thoroughly benchmarking our A2C implementation on Atari, and it doesn't work out-of-the-box for PyBullet. If yours did, it would be interesting to update/have dedicated scripts for that. Finally, could you also add one line in CHANGELOG.md about this PR ? Thanks! |
Sure it would be nice to have both versions! |
Hi, I spent a couple of days benchmarking and testing my A2C implementation and I came to the conclusion that there are some bugs in cherry when using vectorized environments. I wrote some wrappers and a replay memory from scratch (with the same interface as cherry) and used my vectorized A2C implementation (slightly different that the one in this PR) and it works with my wrappers and replay memory but it does not with cherry. I do not have time right now but as soon as I will have some spare time I'll try to write a MWE. I will also implement other algorithms using my classes and release everything as open source code in the hope that it will help debugging cherry |
Hi it is me again 😅
Since I'm working with cherry these days I thought of sharing my implementation of A2C as the new
actor_critic_cartpole.py
example.My implementation is the same as the baselines one with the entropy loss and using the mean as reduction instead of the sum. It is divided in two classes (one for the A2C logic and one child class for the actual policy). I have also used only the declarative interface of pytorch (just becouse i like it more than the functional 😃)
I think that a different coding style with respect to the policy gradient example might be useful to have in the examples folder.
I have tested it and it should be safe to merge, but of course more benchmarking is still required (that is never enough 😃)
Edit: If you like it, wait before merging it. The rewards should not be normalized and I should run the environment just for some steps but even with those fixes I am getting some non coherent values for thevalue_loss
with respect to the ones from baselines.