-
-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the TicTacToe environment #1192
Conversation
Rewards are only set once, only need to accumulate them once. No need to modify the accumulated rewards if they ae not set.
For unknown reasons, this is required for stable baselines training to work correctly. It does not appear to impact other usage as the agents are still looped over correctly after the game ends - just in a different order.
This makes the env code less cluttered and better encapsulates the board behavior. It also expands the checks for a valid move.
These never change. There is no reason to recalculate them constantly.
removes the duplicate calls to check winner
This keeps the env from needing to access the internals of the board to get the moves available.
This causes problems with test cases that have invalid board configurations to test win lines.
AgileRL tutorial is broken again. I think I can fix that and I'll pin the version so it won't be an issue in the future |
Tests are failing due to AgileRL version change. #1193 should resolve that. |
board.play_turn(0, outside_space) | ||
|
||
# 2) move by unknown agent should be rejected | ||
for unknown_agent in [-1, 2]: | ||
with pytest.raises(BadTicTacToeMoveException): | ||
with pytest.raises(AssertionError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pytest raises? Haven’t seen that before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, that's the recommended pytest way to confirm that a block of code raises a specific assertion. Pytest itself doesn't raise the exception. It catches and ignores that exception (because it's expected) but raises an error if the code in that block doesn't raise the expected error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok fair, we should probably also go and make the code adhere to that as well, because I’m pretty sure elsewhere that isn’t done, but I can see it being good practice. Not a problem here tbh since it’s so small I think it’s fine to have here and then we can do an issue or future prs to fix other stuff to adhere to this practice
This now always switches agents in each step. The previous change was a bug that "fixed" the behaviour due to a bug in the SB3 tutorial. The agent should be swapped every step as is done now.
The change to v4 was due to the agent handling at the end of an episode The other changes to the env don't change the behaviour of the env, so it is left at v3.
Apparently the failure to train with SB3 wasn't a bug with the env, it was a SB3 tutorial bug. I removed the change that "fixed" the env because it was wrong. I believe the rest of the changes are valid, but I don't think it requires a version bump, so I moved it back to v3. |
Description
Update the TicTacToe environment to v4
Major changes:
Fix termination handling to work with stable baselines3. This now trains as expected with the SB3 PPO codeBump version to 4Minor changes:
Type of change
Checklist:
pre-commit
checks withpre-commit run --all-files
(seeCONTRIBUTING.md
instructions to set it up)pytest -v
and no errors are present.pytest -v
has generated that are related to my code to the best of my knowledge.