-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
legal_action filtering in predict() of DQN #215
Comments
@zedwei Thanks a lot! it is a very good point that we didn't notice before. I agree with you that it will be better to remove illegal actions before the argmax. We will do an internal comparison between removing illegal ones before or after argmax and let you know. BTW, you mentioned you observe better performance on your own card game when applying before the best action selection. May I know what game you are working on and how large the action space is? In addition, just want to let you keep in mind that we welcome any contributions of implementations of new games to enrich the toolkit if you are interested :) |
I'm working on the Tractor (拖拉机, 80分, 升级, etc. whatever you call it) card game :). The action space is a bit tricky as in Tractor you can 甩牌 and 贴牌 (sorry I don't know how to call them in English), especially the latter one which can be any combination of cards (and will lead to super large action space). In the current stage, I'm temporarily ignoring 甩牌, and make 贴牌 a single "pass" action following some rules. With this trick, the action space is around 200 at this moment. But the legal_action is usually much smaller than that especially when you're NOT the 1st player in the round (i.e. <20). I guess it's the similar case in Doudizhu? I initially tried NFSP, and it didn't work well (before fixing the issue I mentioned in this thread). Then I debugged into it, simplified the training by just using DQN, and found this potential issue. After changing the logic, the training became much more efficient (at least it can beat the naive rule-based agent I developed). I guess the agent was simply randomly "exploring" before the fix and seldom took the "best action". Now I'm going back to NFSP, and may consider trying DeepCFR as well. Yes, I'm definitely happy to contribute once I clean up/refactor my code a bit(if there is no one working on the Tractor game). |
@zedwei Tractor should be very similar to DouDizhu (large action space, but only small parts it is legal). I think there is no one working on the tractor. Contributions are greatly welcome! We have been working on DouDizhu for a while and we find that making use of the action features can efficiently handle the large action space. This feature will be supported in our new version. Hopefully, it will help in DouDIzhu as well. We will make the removing legal action parts within the predict function in the new version. Thanks again for pointing this out! |
This is great to know! Although right now my trained agent can beat rule-based agent, it's still far away from competing with human - e.g. It was able to learn to play "A", pairs, and tractors first. But it couldn't (yet) learn to "吊主牌" to pass the turn to its partner (when itself is out of good moves). It couldn't learn to exhaust one suit first so it can "kill" others' hand when they play this suit, etc. These are kind of "implicit" strategies human players summarized from our experiences, and sometimes requires the partner's collaboration in order to work (team play). I'm not sure if today's RL/DL algos are powerful enough to learn these strategies by self-play. If so, what would be the |
I think so. GIven enough training, the agent should gradually learn these skills. |
@daochenzha - Similarly, this line in the DQN train() function probably also needs to be re-considered/experimented. Theoretically it makes more sense(at least to me) to apply legal_action filter to the "next best action" as well during training. Right now the code gets the argmax of all actions without legal_action filtering. Let me know if it makes sense. I'm still running some experiments but initial evidences showed that adding legal_action filtering in training leads to better result in my tractor game. |
@zedwei Thanks for the feedback! Yes, I agree with you. But I am hesitating what is the most efficient way to implement this. I expect we need to save the legal actions to the buffer as well and sample them from the buffer. In this way, we don't need to obtain the legal actions every time we sample a transition. Then we use the argmax based on the index of the legal actions. Is it how you implement this? You may take a look at the code in dmc branch, where the legal actions are applied in predict as you suggests https://github.com/datamllab/rlcard/blob/dmc/rlcard/agents/dqn_agent.py#L177-L180 Please let me know if you have any suggestions. |
Yes, I did push the legal_actions of next state into buffer in my own implementation - basically add another element in the Transition unit. This is much faster than re-computing legal actions during training. |
@zedwei I have implemented this. See https://github.com/datamllab/rlcard/blob/dmc/rlcard/agents/dqn_agent.py#L195-L200 Interestingly, when I run on Dou Dizhu, I didn't observe a clear difference. |
Interesting. I'll give your implementation a try in my game when I have time and compare with my own implementation. By looking at the code they're doing the same thing. Btw, below is my naive implementation without np matrix operations. You may give it a try as well in doudizhu.
|
First of all, thanks a lot for open-sourcing this excellent framework. It helped me a lot to learn RL and develop the AI strategy for my own card game.
During debugging, I noticed that the current DQN agent doesn't apply legal_action filtering in its predict() function. It leads to two "potential" problems in my opinion -
But IMO it should be 9 actions with 0.01 and 1 (best) action with 0.91 instead.
I think this is the question of whether to apply legal_action filtering before(or after) best action selection. I'm not sure if there is any theory to support either way. But in my own card game, applying the filtering BEFORE best action selection works much better and lead to quicker/more meaningful learnings.
The text was updated successfully, but these errors were encountered: