Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NormalizationRewardMixin #17

Open
nobodyPerfecZ opened this issue Dec 9, 2024 · 6 comments
Open

Add support for NormalizationRewardMixin #17

nobodyPerfecZ opened this issue Dec 9, 2024 · 6 comments

Comments

@nobodyPerfecZ
Copy link

Introduce a NormalizationRewardMixin to standardize reward scaling across multiple environments, similar to the existing NormalizationObservationMixin in mixins.py.
As a reference for implementation, the approach used in PureJaxRL can be adapted.

@keraJLi
Copy link
Owner

keraJLi commented Dec 10, 2024

Hi @nobodyPerfecZ! I've implemented reward normalization on the main branch (see here). All algorithms should run without any errors, and I manually checked that rewards have mean 0 and std 1 after some training. My testing has been by no means comprehensive, but I'm positive everything works (since it follows the exact same logic as obsnorm). Let me know and feel free to reopen the issue if you have any feedback or encounter any problems.

@keraJLi keraJLi closed this as completed Dec 10, 2024
@nobodyPerfecZ
Copy link
Author

Hi @keraJLi! I'd like to suggest reopening this issue, as it could be more beneficial to normalize rewards by considering the corresponding returns over each batch instead of relying solely on the "plain rewards" (see here).
The proposed normalization wrapper would function slightly differently than NormalizationObservationMixin.
I’ve also implemented the concept of NormalizationRewardMixin in a fork (see here). However, it currently has some conflicts that I need to resolve later.

@keraJLi
Copy link
Owner

keraJLi commented Dec 13, 2024

You are right, the method I have implemented seems to be non-standard. pureJaxRL, Gymnasium and stables baselines 3 implement reward scaling as described in Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. They rely on a specific backwards-discounted sum of episode rewards, which is hard for me to interpret. I believe Gymnasium got it wrong as well, see my related issue. I will update the method in due time.

Thank you for your suggestions! They are very valuable.

@keraJLi keraJLi reopened this Dec 13, 2024
@keraJLi
Copy link
Owner

keraJLi commented Dec 13, 2024

The changes can be found here. Let me know if this is what you were thinking about. Thanks!

@nobodyPerfecZ
Copy link
Author

I suggest improving the implementation by replacing the use of reward_normalization_discount in mixins.py with gamma from algorithm.py to eliminate unnecessary duplication of hyperparameters.

@keraJLi
Copy link
Owner

keraJLi commented Dec 16, 2024

I do not believe that this is a duplication of hyperparameters. While I am aware that the aforementioned paper uses the same value to discount both the episodic returns and the reward normalization factor, I don't think there are any specific theoretical reasons that justifies this. I chose to keep them separate because I couldn't see why they should even match at all generally. I am happy to be proven wrong though!
As a reference, both stable baselines and gymnasium allow for separate discounts. Gymnasium describes the gamma parameter of their wrapper as "The discount factor that is used in the exponential moving average"; not mentioning the return. I believe they ambiguously named both parameters gamma to adhere to the notation in the paper.

If you know of any reference that explains the relationship between both parameters, please do let me know! In that case, I will make gamma the default value of reward_normalization_discount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants