-
-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal/Question] Incorrect documentation of NormalizeReward wrapper #1272
Comments
@keraJLi Thanks for the issue. For point 3, this seems to have originated in openai/gym#2784 with no justification for it sadly, so won't be surprised if it is wrong Small questions
|
I thought about this again and believe this is what was meant: If the rewards are iid (which they are not generally),
I do not believe there is a fixed value, since the normalization factor is not directly related to the episodic return. After some more digging, i found that other people seemed to have investigated the same question (and came to the same conclusion): openai/gym#2387 (comment)
I will create a pull request in the coming days 🤗 |
Small update: After some calculations, I have confirmed that |
Proposal
The documentation for NormalizeReward (here in code), is partially incorrect and unclear. It states:
The wrapper does not normalize discounted returns to have a mean of 0. Instead, the rewards are merely divided by the running standard deviation of a specific term (see here). The description should be changed accordingly.
I am also concerned about the discounted returns having a variance of 1. The term we are computing the running variance of is$\sum_{t=0}^T \gamma^{T-t} r_t$ , where $T$ is an episode's current time step. I would interpret this as "backwards-discounting". Importantly, this is different from the discounted return ($\sum_{t=0}^\infty \gamma^t r_t$ ), or even what I would call the discounted sum of previous rewards up until timestep $T$ ($\sum_{t=0}^T \gamma^t r_t$ ). I assume the latter is what the description means by "discounted returns", which should be clarified if that is the case.$T$ . In fact, I was not able to empirically confirm that it does. In contrast, I was able to confirm $\text{Var}\left[\sum_{t=0}^T \gamma^{T-t} r_t\right] = 1$ .
To me, it is also unclear how dividing by this term leads to a unit variance of either the discounted return or the discounted return up to
The reference in the wrapper's description does not mention any theoretical properties of this discounting theme.
The description also says "The exponential moving average will have variance$(1 - \gamma)^2$ ". I do not understand what exactly this means (and shouldn't it depend on the reward?), and would be happy about any explanation or reference that explains this.
I would be happy about any clarifications concerning points 2 and 3. I also suggest adding the relevant references to the documentation.
Motivation
I am using reward scaling myself, and find Gymnasium to be an important reference implementation. Clear documentation is important to understand the methods that Gymnasium implements.
Pitch
Alternatives
No response
Additional context
No response
Checklist
The text was updated successfully, but these errors were encountered: