Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New A2C example with entropy #26

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

galatolofederico
Copy link
Contributor

@galatolofederico galatolofederico commented Feb 18, 2020

Hi it is me again 😅
Since I'm working with cherry these days I thought of sharing my implementation of A2C as the new actor_critic_cartpole.py example.
My implementation is the same as the baselines one with the entropy loss and using the mean as reduction instead of the sum. It is divided in two classes (one for the A2C logic and one child class for the actual policy). I have also used only the declarative interface of pytorch (just becouse i like it more than the functional 😃)
I think that a different coding style with respect to the policy gradient example might be useful to have in the examples folder.
I have tested it and it should be safe to merge, but of course more benchmarking is still required (that is never enough 😃)

Edit: If you like it, wait before merging it. The rewards should not be normalized and I should run the environment just for some steps but even with those fixes I am getting some non coherent values for the value_loss with respect to the ones from baselines.

@galatolofederico
Copy link
Contributor Author

galatolofederico commented Feb 19, 2020

I've just pushed 3 more commits, the changes to first one are:

  • Multiple workers (as in A3C)
  • RMSprop optimizer instead of Adam
  • Avoid using the Runner Wrapper (i think that there is a bug when the environment is vectorized)
  • No rewards normalization

I followed the paper Asynchronous Methods for Deep Reinforcement Learning and with these changes the implementation should be identical to the baselines one.

@galatolofederico galatolofederico mentioned this pull request Feb 21, 2020
@seba-1511
Copy link
Member

Thanks again for a nice PR @galatolofederico !

I would like to keep a copy of the original actor-critic example, as it's useful for newcomers to compare against the PyTorch examples implementation. What do you think about creating a folder examples/cartpole and having the reinforce, actor-critic, and yours (actor_critic_declarative.py ?) in there ?

When you say it's identical to the baselines one, have you tried it on Atari or MuJoCo/PyBullet ? I've never got to thoroughly benchmarking our A2C implementation on Atari, and it doesn't work out-of-the-box for PyBullet. If yours did, it would be interesting to update/have dedicated scripts for that.

Finally, could you also add one line in CHANGELOG.md about this PR ? Thanks!

@galatolofederico
Copy link
Contributor Author

Sure it would be nice to have both versions!
I am running some atari benchmarks right now, I should have the results in a couple of days.
For now lets leave this PR open, I have made some improvements and I still have to push the changes.

@galatolofederico
Copy link
Contributor Author

Hi, I spent a couple of days benchmarking and testing my A2C implementation and I came to the conclusion that there are some bugs in cherry when using vectorized environments. I wrote some wrappers and a replay memory from scratch (with the same interface as cherry) and used my vectorized A2C implementation (slightly different that the one in this PR) and it works with my wrappers and replay memory but it does not with cherry. I do not have time right now but as soon as I will have some spare time I'll try to write a MWE. I will also implement other algorithms using my classes and release everything as open source code in the hope that it will help debugging cherry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants