-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for NormalizationRewardMixin
#17
Comments
Hi @nobodyPerfecZ! I've implemented reward normalization on the main branch (see here). All algorithms should run without any errors, and I manually checked that rewards have mean 0 and std 1 after some training. My testing has been by no means comprehensive, but I'm positive everything works (since it follows the exact same logic as obsnorm). Let me know and feel free to reopen the issue if you have any feedback or encounter any problems. |
Hi @keraJLi! I'd like to suggest reopening this issue, as it could be more beneficial to normalize rewards by considering the corresponding returns over each batch instead of relying solely on the "plain rewards" (see here). |
You are right, the method I have implemented seems to be non-standard. pureJaxRL, Gymnasium and stables baselines 3 implement reward scaling as described in Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. They rely on a specific backwards-discounted sum of episode rewards, which is hard for me to interpret. I believe Gymnasium got it wrong as well, see my related issue. I will update the method in due time. Thank you for your suggestions! They are very valuable. |
The changes can be found here. Let me know if this is what you were thinking about. Thanks! |
I suggest improving the implementation by replacing the use of |
I do not believe that this is a duplication of hyperparameters. While I am aware that the aforementioned paper uses the same value to discount both the episodic returns and the reward normalization factor, I don't think there are any specific theoretical reasons that justifies this. I chose to keep them separate because I couldn't see why they should even match at all generally. I am happy to be proven wrong though! If you know of any reference that explains the relationship between both parameters, please do let me know! In that case, I will make |
Introduce a
NormalizationRewardMixin
to standardize reward scaling across multiple environments, similar to the existingNormalizationObservationMixin
inmixins.py
.As a reference for implementation, the approach used in PureJaxRL can be adapted.
The text was updated successfully, but these errors were encountered: