New A2C example with entropy #26

galatolofederico · 2020-02-18T13:12:43Z

Hi it is me again 😅
Since I'm working with cherry these days I thought of sharing my implementation of A2C as the new actor_critic_cartpole.py example.
My implementation is the same as the baselines one with the entropy loss and using the mean as reduction instead of the sum. It is divided in two classes (one for the A2C logic and one child class for the actual policy). I have also used only the declarative interface of pytorch (just becouse i like it more than the functional 😃)
I think that a different coding style with respect to the policy gradient example might be useful to have in the examples folder.
I have tested it and it should be safe to merge, but of course more benchmarking is still required (that is never enough 😃)

Edit: If you like it, wait before merging it. The rewards should not be normalized and I should run the environment just for some steps but even with those fixes I am getting some non coherent values for the value_loss with respect to the ones from baselines.

galatolofederico · 2020-02-19T16:15:45Z

I've just pushed 3 more commits, the changes to first one are:

Multiple workers (as in A3C)
RMSprop optimizer instead of Adam
Avoid using the Runner Wrapper (i think that there is a bug when the environment is vectorized)
No rewards normalization

I followed the paper Asynchronous Methods for Deep Reinforcement Learning and with these changes the implementation should be identical to the baselines one.

…ckward through the bootstrap path

seba-1511 · 2020-02-22T17:37:35Z

Thanks again for a nice PR @galatolofederico !

I would like to keep a copy of the original actor-critic example, as it's useful for newcomers to compare against the PyTorch examples implementation. What do you think about creating a folder examples/cartpole and having the reinforce, actor-critic, and yours (actor_critic_declarative.py ?) in there ?

When you say it's identical to the baselines one, have you tried it on Atari or MuJoCo/PyBullet ? I've never got to thoroughly benchmarking our A2C implementation on Atari, and it doesn't work out-of-the-box for PyBullet. If yours did, it would be interesting to update/have dedicated scripts for that.

Finally, could you also add one line in CHANGELOG.md about this PR ? Thanks!

galatolofederico · 2020-02-22T17:44:17Z

Sure it would be nice to have both versions!
I am running some atari benchmarks right now, I should have the results in a couple of days.
For now lets leave this PR open, I have made some improvements and I still have to push the changes.

galatolofederico · 2020-02-26T12:07:48Z

Hi, I spent a couple of days benchmarking and testing my A2C implementation and I came to the conclusion that there are some bugs in cherry when using vectorized environments. I wrote some wrappers and a replay memory from scratch (with the same interface as cherry) and used my vectorized A2C implementation (slightly different that the one in this PR) and it works with my wrappers and replay memory but it does not with cherry. I do not have time right now but as soon as I will have some spare time I'll try to write a MWE. I will also implement other algorithms using my classes and release everything as open source code in the hope that it will help debugging cherry

galatolofederico added 4 commits February 18, 2020 13:59

New A2C example with entropy

31e8b50

Removed rewards normalization

819943e

Use multiple environments

85981d0

Without Runner wrapper and with RMSprop optimizer

93076ad

galatolofederico added 4 commits February 20, 2020 16:36

Boostraping rewards as shown in Algorithm S3 in the A3C paper

1f7ce52

Bugfix: use the last state for rewards boostraping

9ab728f

Bugfix: do not backward through the boostrap path

0663ef0

Bootstrap tensor shape fix

0465132

galatolofederico mentioned this pull request Feb 21, 2020

Vec discount fix #27

Merged

galatolofederico added 2 commits February 21, 2020 14:53

Revert: according to the A3C paper Appendix Algorithm S3 we should ba…

540b338

…ckward through the bootstrap path

Using the last next state estimated V as boostrap

1eda7c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New A2C example with entropy #26

New A2C example with entropy #26

galatolofederico commented Feb 18, 2020 •

edited

Loading

galatolofederico commented Feb 19, 2020 •

edited

Loading

seba-1511 commented Feb 22, 2020

galatolofederico commented Feb 22, 2020

galatolofederico commented Feb 26, 2020

New A2C example with entropy #26

Are you sure you want to change the base?

New A2C example with entropy #26

Conversation

galatolofederico commented Feb 18, 2020 • edited Loading

galatolofederico commented Feb 19, 2020 • edited Loading

seba-1511 commented Feb 22, 2020

galatolofederico commented Feb 22, 2020

galatolofederico commented Feb 26, 2020

galatolofederico commented Feb 18, 2020 •

edited

Loading

galatolofederico commented Feb 19, 2020 •

edited

Loading