Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

couple a3c questions / recommendations for generalizing beyond Atari #76

Open
M00NSH0T opened this issue Feb 19, 2018 · 3 comments
Open

Comments

@M00NSH0T
Copy link

M00NSH0T commented Feb 19, 2018

First, thanks for making this. It's very easy to get started with and has really helped me move things forward on a personal project of mine I've been struggling with for months. This is really awesome work. Thanks again.

In my efforts to tweak the code from your A3C cartpole implementation to work with my own custom OpenAI environment, I've discovered a few things that I think can help make it generalize a bit more.

  1. the paper says all the layers except the output are generally shared between the actor and critic. I'm curious - why do both your actor and critic networks have a private hidden layer before the output? Mine has 4 shared relu layers each with 1600 neurons, then the actor gets a softmax and the critic gets a linear layer for the output, and this has really helped my stability issues, though my environment is quite a bit more complicated than Atari, so I'm not sure if there's some advantage to each having a private layer. I could totally be missing something here though.
  2. My environment has a massive discrete action space (1547 actions), which is what led me here. So, one change I made was to add an action filter to your "get_action" function. This is a vector of ones and zeros output by my environment that can be used to zero out the probabilities for invalid actions. I needed to renormalize the output from the actor so the probabilities all sum to 1 again after doing this, but it's really helped speed things up. Probably unnecessary unless you're dealing with large action spaces, but crucial if you are. You can always penalize invalid actions instead, but in a large action space, but I've found that adds a ton of time to training. Anyway, here's what I did (sorry - can't get the formatting to work here for some reason):
    def get_action(self, state,actionfilter): policy = self.actor.predict(np.reshape(state, [1, self.state_size]))[0] policy=np.multiply(policy,actionfilter) probs=policy/np.sum(policy) action=np.random.choice(self.action_size, 1, p=probs)[0] return action
    where action filter is provided through a custom function with the environment. It would be easy enough to only use it if it was passed, or just default it to a vector of ones with the same size as the action space.
  3. I'm in the process now of giving each actor a different epsilon that's used to determine how much it explores, which will also be feed into the 'get_action' function. The original paper claims that if each agent has a different exploration policy, it can really help with stability, so I'm hoping this will help a bit more. To date, I've had to slowly increment my learning rates to find one that works (for me, I had to go all the way down to 1e-10).
    Anything greater than that can cause an actor to return NaNs and then the whole thing falls apart, and anything lower and it just inches along at a glacial pace. I wish I could take it up a bit, but the NaNs are killing me. I tried gradient clipping, but it's really hard finding a good threshold to use. Anyway, implementing different exploration policies should be pretty easy to do... might be worth checking out. I suppose it would also be possible to randomly pick a more abstract exploration type as well during initialization. Like have one that's pure greedy, another epsilon-greedy with some random epsilon, and maybe a couple other types of policies thrown in there for kicks. I'm going to test this out this week to see if it has any effect. I can report back if you're interested.
@M00NSH0T
Copy link
Author

M00NSH0T commented Feb 20, 2018

Fixed the NaN issue for the most part. I had to change the second to last activation function from relu to tanh. Apparently using a relu to softmax transition can cause NaNs in large networks.

Also, I can confirm the findings of the paper: using a different policy for each agent can significantly help stability / learning. My algorithm was getting stuck on a local solution before, but now one of the agents follows a pure uniform random policy, another one is purely greedy, and the rest are greedy X% of the time (X varies by agent) and use the probabilities spit out by the network to determine actions otherwise. This has really helped break through the plateau it was hitting before.

Also, I'm building a couple additional custom policies that follow a set of rules and a couple more that focus on specific sets of my larger action space (I have 12 threads to play with). I'm hoping that will help drive things a bit more quickly as well. Your implementation makes this all really easy to do within the get_action function, to which I can pass environment variables from the loop that calls it.

@keon
Copy link
Member

keon commented Feb 20, 2018

Thanks for the feedback! This really helps a lot.

I'll summon the other guys to answer your question :)
@chris-chris @wooridle @dnddnjs @Hyeokreal @jcwleo @zzing0907

@M00NSH0T
Copy link
Author

thanks. I've been struggling with my own project for months now... this has definitely helped a ton, but I'm not quite there yet. As I tweak things, I'm still running into stability issues with my network returning NaNs occasionally. This could happen 5 hours into training, which is extremely aggravating. I'm tweaking your code to break out of training when that happens and create a copy of the weight files ever N episodes so I don't lose too much progress (I'm expecting this will take 1-2 weeks to train in the end).

I guess one additional question I have while I'm on the subject would be about the actor critic optimizers. The paper talks about how the optimizer (RMSProp is what they used) worked best when it shared the "the elementwise squared gradients" across threads. It still works without doing that, just not as not nearly as well in large networks if I'm understanding the results correctly. I wonder if there's a way to do this here?

Also, I can't find any good articles / discussion about how the learning rates should compare between the actor and the critic. Or if I want to use gradient clipping, how might I find a good threshold to use for each? This process of iterative tweaks is taking far too long in my case because it takes 10 minutes for my first set of episodes to complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants