Add support for `NormalizationRewardMixin` #17

nobodyPerfecZ · 2024-12-09T19:36:37Z

Introduce a NormalizationRewardMixin to standardize reward scaling across multiple environments, similar to the existing NormalizationObservationMixin in mixins.py.
As a reference for implementation, the approach used in PureJaxRL can be adapted.

The text was updated successfully, but these errors were encountered:

keraJLi · 2024-12-10T10:34:13Z

Hi @nobodyPerfecZ! I've implemented reward normalization on the main branch (see here). All algorithms should run without any errors, and I manually checked that rewards have mean 0 and std 1 after some training. My testing has been by no means comprehensive, but I'm positive everything works (since it follows the exact same logic as obsnorm). Let me know and feel free to reopen the issue if you have any feedback or encounter any problems.

nobodyPerfecZ · 2024-12-10T21:04:43Z

Hi @keraJLi! I'd like to suggest reopening this issue, as it could be more beneficial to normalize rewards by considering the corresponding returns over each batch instead of relying solely on the "plain rewards" (see here).
The proposed normalization wrapper would function slightly differently than NormalizationObservationMixin.
I’ve also implemented the concept of NormalizationRewardMixin in a fork (see here). However, it currently has some conflicts that I need to resolve later.

keraJLi · 2024-12-13T13:30:24Z

You are right, the method I have implemented seems to be non-standard. pureJaxRL, Gymnasium and stables baselines 3 implement reward scaling as described in Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. They rely on a specific backwards-discounted sum of episode rewards, which is hard for me to interpret. I believe Gymnasium got it wrong as well, see my related issue. I will update the method in due time.

Thank you for your suggestions! They are very valuable.

keraJLi · 2024-12-13T15:03:54Z

The changes can be found here. Let me know if this is what you were thinking about. Thanks!

nobodyPerfecZ · 2024-12-16T18:42:53Z

I suggest improving the implementation by replacing the use of reward_normalization_discount in mixins.py with gamma from algorithm.py to eliminate unnecessary duplication of hyperparameters.

keraJLi · 2024-12-16T19:49:20Z

I do not believe that this is a duplication of hyperparameters. While I am aware that the aforementioned paper uses the same value to discount both the episodic returns and the reward normalization factor, I don't think there are any specific theoretical reasons that justifies this. I chose to keep them separate because I couldn't see why they should even match at all generally. I am happy to be proven wrong though!
As a reference, both stable baselines and gymnasium allow for separate discounts. Gymnasium describes the gamma parameter of their wrapper as "The discount factor that is used in the exponential moving average"; not mentioning the return. I believe they ambiguously named both parameters gamma to adhere to the notation in the paper.

If you know of any reference that explains the relationship between both parameters, please do let me know! In that case, I will make gamma the default value of reward_normalization_discount.

keraJLi closed this as completed Dec 10, 2024

keraJLi reopened this Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `NormalizationRewardMixin` #17

Add support for `NormalizationRewardMixin` #17

nobodyPerfecZ commented Dec 9, 2024

keraJLi commented Dec 10, 2024

nobodyPerfecZ commented Dec 10, 2024

keraJLi commented Dec 13, 2024 •

edited

Loading

keraJLi commented Dec 13, 2024

nobodyPerfecZ commented Dec 16, 2024

keraJLi commented Dec 16, 2024 •

edited

Loading

Add support for NormalizationRewardMixin #17

Add support for NormalizationRewardMixin #17

Comments

nobodyPerfecZ commented Dec 9, 2024

keraJLi commented Dec 10, 2024

nobodyPerfecZ commented Dec 10, 2024

keraJLi commented Dec 13, 2024 • edited Loading

keraJLi commented Dec 13, 2024

nobodyPerfecZ commented Dec 16, 2024

keraJLi commented Dec 16, 2024 • edited Loading

Add support for `NormalizationRewardMixin` #17

Add support for `NormalizationRewardMixin` #17

keraJLi commented Dec 13, 2024 •

edited

Loading

keraJLi commented Dec 16, 2024 •

edited

Loading