-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in Reward function #72
Comments
Just to be clear, the problem you mention happens when Since if self.reward_only_positive:
reward = max(0, delta_enemy + delta_deaths) # shield regeneration
else:
reward = delta_enemy + delta_deaths - delta_ally |
Yes! And this situation happens especially in Maps with Protoss, which can regenerate shields. And I found in the map "3s5z_vs_3s6z" that the allies can learn a strategy to increase the reward without winning the game. Specifically, the allies can learn a pattern to injure the enemies a little and immediately run away from them by hiding in a corner and waiting the enemies to recover, and then repeat. However, the problem can be solved when modifying the reward function to this: if self.reward_only_positive: |
I sent a quick PR with this tiny change. Check if it solves the issues you have in your experiments. |
Thanks! I think this will do. I will check this again and get back to you soon. |
Thanks both for pointing this out and for sending the PR #76. We're going to see how to best integrate this fix in the upcoming SMAC versions to avoid confusion. One issue we've noticed is that some people compare results using different SMAC/StarCraftII versions and report unfair comparisons between methods in their work. |
I have just watched a similar behavior happen in maps which have Medivacs (MMM and MMM2). They have a healing power and my agents have learned to wait for the Medivac to heal the last unit before killing it, sometimes at the expense of the match. |
Yes! I've also noticed this situation happens in MMM and MMM2. But your solution by changing the only positive reward to "reward = max(0, delta_enemy + delta_deaths)" is able to fix this issue. |
See oxwhirl#72 for more details.
Why not merge #76 ? |
Given how much the benchmark has been used by the community, fixing this issue now will result in unfair comparisons with existing work. Therefore, we will not merge it with the main branch in this repo. If you really want to use that particular version of SMAC, you are welcome to use the branch of #76: https://github.com/douglasrizzo/smac/tree/patch-1. But you must make it clear that this is not the standard version of the benchmark when presenting those results. Lastly, this issue is resolved in SMACv2, the second version of the benchmark which I encourage you to use instead: https://github.com/oxwhirl/smacv2. |
I found a bug in the reward function in the file "./env/starcraft2/starcraft2.py", line 729. The bug is that when the enemies heal or regenerate shield, the allies will receive rewards. The location of the bug is in the function "reward_battle(self)", Line 729:
In Line 729, the argument "delta_enemy" is negative when the enemy heals or regenerates shield. However, Line 729 uses abs() to convert "delta_enemy + delta_deaths" to a positive reward. This means the allies are rewarded when the enemy heals.
One consequence of this problem is that the allies may only learn to hurt the enemies but never kill them so that they can receive rewards when the enemies heal afterwards. In that case, a policy can learn to increase rewards but never win the game.
The text was updated successfully, but these errors were encountered: