-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with MaskablePPO #195
Comments
i try to modify the source code change validate_args to False and it works |
dtype also have problem sometimes |
The error stops occuring in my case too but it doesn't learn with |
same problem, i think this need some tech support to solve this problem. |
Did you menage to solve problem using this solution from this issue? |
yes,at least my env works! |
class _Simplex(Constraint):
|
It looks like this in my case
And my mask doesn't work after this change. Agent keeps making invalid moves. |
this is the source code of torch, you can see the answer pytorch/pytorch#87468 (comment), edit the source code and it works. my env can learn without errors and the performance is same as before this change. |
if the agent take error action, maybe you should check you mask of action. |
I did change source code and this is what it looks like after change. You are right my action mask had error. Now its seems to be learning but rewards is increasing slowly. And i still have problem that mean length of episode is increasing even though I know can win fast in some cases. I'll try to give feedback if it menaged to learn. |
if the rewards increasing slowly, maybe is the problem of reward function. |
🐛 Bug
Hi
I had problems with maskable ppo which I described here DLR-RM/stable-baselines3#1596. I thought that I found solution in one of issues #81 (comment). The problem is that error stopped occuring but in the same time agent lost its ability to learn. Below are screenshots of mean rewards, 150k and 260k timesteps are for case with error and 4M timesteps is for case without error.
Unfortunately I don't have screenshots of learning process where agent menaged to get mean reward of ~-0.75 before error.
Code example
The only thing I changed since last issue in code is solution from #81
Relevant log output / Error message
No response
System Info
No response
Checklist
The text was updated successfully, but these errors were encountered: