Problems with MaskablePPO #195

koliber31 · 2023-07-12T19:29:44Z

🐛 Bug

Hi
I had problems with maskable ppo which I described here DLR-RM/stable-baselines3#1596. I thought that I found solution in one of issues #81 (comment). The problem is that error stopped occuring but in the same time agent lost its ability to learn. Below are screenshots of mean rewards, 150k and 260k timesteps are for case with error and 4M timesteps is for case without error.

Unfortunately I don't have screenshots of learning process where agent menaged to get mean reward of ~-0.75 before error.

Code example

The only thing I changed since last issue in code is solution from #81

# Reinitialize with updated logits
        super().__init__(logits=logits, validate_args=False)

        # self.probs may already be cached, so we must force an update
        self.probs = logits_to_probs(self.logits)

Relevant log output / Error message

No response

System Info

No response

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
I have provided a minimal and working example to reproduce the bug
I have checked my env using the env checker
I've used the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

yiptsangkin · 2023-07-13T08:48:11Z

i try to modify the source code change validate_args to False and it works

yiptsangkin · 2023-07-13T08:48:40Z

dtype also have problem sometimes

koliber31 · 2023-07-13T10:17:33Z

i try to modify the source code change validate_args to False and it works

The error stops occuring in my case too but it doesn't learn with validate_args=False.
Does it in your case? Could you provide any screen of learning curve?
What is dtype in your case? I tried int32, float32 and none didn't help.

yiptsangkin · 2023-07-22T00:16:15Z

same problem, i think this need some tech support to solve this problem.

yiptsangkin · 2023-07-22T00:39:41Z

https://discuss.pytorch.org/t/distributions-categorical-fails-with-constraint-simplex-but-manual-check-passes/163209/6

yiptsangkin · 2023-07-22T03:02:35Z

pytorch/pytorch#87468

koliber31 · 2023-07-22T10:01:13Z

pytorch/pytorch#87468

Did you menage to solve problem using this solution from this issue?

yiptsangkin · 2023-07-22T11:04:00Z

yes，at least my env works！

yiptsangkin · 2023-07-22T11:04:24Z

pytorch/pytorch#87468

Did you menage to solve problem using this solution from this issue?

class _Simplex(Constraint):
...
def check(self, value):

    return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < 1e-6)

    tol = torch.finfo(value.dtype).eps * 10 * value.size(-1) ** 0.5

    return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < tol)

koliber31 · 2023-07-22T11:17:13Z

It looks like this in my case

class _Simplex(Constraint):
    """
    Constrain to the unit simplex in the innermost (rightmost) dimension.
    Specifically: `x >= 0` and `x.sum(-1) == 1`.
    """
    event_dim = 1

    def check(self, value):
        # return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < 1e-6)
        tol = torch.finfo(value.dtype).eps * 10 * value.size(-1) ** 0.5
        return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < tol)

And my mask doesn't work after this change. Agent keeps making invalid moves.
Edit:
It keeps making invalid moves even without this change. Had to make some mistake in between :(. Does this constraint look like this in your case and does it learn?

yiptsangkin · 2023-07-22T12:01:21Z

It looks like this in my case
class _Simplex(Constraint):
    """
    Constrain to the unit simplex in the innermost (rightmost) dimension.
    Specifically: `x >= 0` and `x.sum(-1) == 1`.
    """
    event_dim = 1

    def check(self, value):
        # return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < 1e-6)
        tol = torch.finfo(value.dtype).eps * 10 * value.size(-1) ** 0.5
        return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < tol)
And my mask doesn't work after this change. Agent keeps making invalid moves. Edit: It keeps making invalid moves even without this change. Had to make some mistake in between :(. Does this constraint look like this in your case and does it learn?

this is the source code of torch, you can see the answer pytorch/pytorch#87468 (comment), edit the source code and it works. my env can learn without errors and the performance is same as before this change.

yiptsangkin · 2023-07-22T12:02:36Z

It looks like this in my case
class _Simplex(Constraint):
    """
    Constrain to the unit simplex in the innermost (rightmost) dimension.
    Specifically: `x >= 0` and `x.sum(-1) == 1`.
    """
    event_dim = 1

    def check(self, value):
        # return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < 1e-6)
        tol = torch.finfo(value.dtype).eps * 10 * value.size(-1) ** 0.5
        return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < tol)
And my mask doesn't work after this change. Agent keeps making invalid moves. Edit: It keeps making invalid moves even without this change. Had to make some mistake in between :(. Does this constraint look like this in your case and does it learn?

if the agent take error action, maybe you should check you mask of action.

koliber31 · 2023-07-22T12:07:31Z

I did change source code and this is what it looks like after change. You are right my action mask had error. Now its seems to be learning but rewards is increasing slowly. And i still have problem that mean length of episode is increasing even though I know can win fast in some cases. I'll try to give feedback if it menaged to learn.

yiptsangkin · 2023-07-22T12:23:33Z

I did change source code and this is what it looks like after change. You are right my action mask had error. Now its seems to be learning but rewards is increasing slowly. And i still have problem that mean length of episode is increasing even though I know can win fast in some cases. I'll try to give feedback if it menaged to learn.

if the rewards increasing slowly, maybe is the problem of reward function.

koliber31 · 2023-07-22T12:44:14Z

I did change source code and this is what it looks like after change. You are right my action mask had error. Now its seems to be learning but rewards is increasing slowly. And i still have problem that mean length of episode is increasing even though I know can win fast in some cases. I'll try to give feedback if it menaged to learn.

if the rewards increasing slowly, maybe is the problem of reward function.

My rewards are just:
win = 1
lose = -1
draw = 0.25
Does your changed source code looks like mine? And did you change class _Symmetric too or just _Simplex?
Edit:
It still doesn't learn. Learning curves look exactly like these with validate_args=False

yiptsangkin · 2023-07-22T12:57:31Z

I did change source code and this is what it looks like after change. You are right my action mask had error. Now its seems to be learning but rewards is increasing slowly. And i still have problem that mean length of episode is increasing even though I know can win fast in some cases. I'll try to give feedback if it menaged to learn.

if the rewards increasing slowly, maybe is the problem of reward function.

My rewards are just: win = 1 lose = -1 draw = 0.25 Does your changed source code looks like mine? And did you change class _Symmetric too or just _Simplex? Edit: It still doesn't learn. Learning curves look exactly like these with validate_args=False

just _Simplex. my learning curves looks no problem.

koliber31 added the custom gym env Issue related to Custom Gym Env label Jul 12, 2023

araffin mentioned this issue Jan 16, 2024

[Bug]: producing NAN values during training in MaskablePPO #221

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with MaskablePPO #195

Problems with MaskablePPO #195

koliber31 commented Jul 12, 2023

yiptsangkin commented Jul 13, 2023

yiptsangkin commented Jul 13, 2023

koliber31 commented Jul 13, 2023

yiptsangkin commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

koliber31 commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

koliber31 commented Jul 22, 2023 •

edited

Loading

yiptsangkin commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

koliber31 commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

koliber31 commented Jul 22, 2023 •

edited

Loading

yiptsangkin commented Jul 22, 2023

Problems with MaskablePPO #195

Problems with MaskablePPO #195

Comments

koliber31 commented Jul 12, 2023

🐛 Bug

Code example

Relevant log output / Error message

System Info

Checklist

yiptsangkin commented Jul 13, 2023

yiptsangkin commented Jul 13, 2023

koliber31 commented Jul 13, 2023

yiptsangkin commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

koliber31 commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

koliber31 commented Jul 22, 2023 • edited Loading

yiptsangkin commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

koliber31 commented Jul 22, 2023

yiptsangkin commented Jul 22, 2023

koliber31 commented Jul 22, 2023 • edited Loading

yiptsangkin commented Jul 22, 2023

koliber31 commented Jul 22, 2023 •

edited

Loading

koliber31 commented Jul 22, 2023 •

edited

Loading