-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust eweights calculation to avoid precision issues #509
Conversation
Codecov Report
@@ Coverage Diff @@
## master #509 +/- ##
==========================================
+ Coverage 90.1% 90.11% +0.01%
==========================================
Files 21 21
Lines 2021 2024 +3
==========================================
+ Hits 1821 1824 +3
Misses 200 200
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic issues aside (I don't like spaces around ^
or explicit return
s inside of do
blocks) this seems fine I guess?
Why the double negation in |
Cause those are different things? The goal was to minimize the potential impact on existing code. This approach simply scales the weights without changing the relative size, as noted above.
|
I strongly dislike this meaning. |
FWIW, since
I don't think it's a big deal, but it seems weird to me that you'd want to have a lower |
How does this fix precision issues? Note that this is breaking, so we need a strong rationale to change the existing definition. Is the new one more standard? |
By limiting the range of values between 0.0 and 1.0 we're minimizing the absolute floating point error. This is also providing a more consistent range of weights regardless of
That's true. Unfortunately, I've managed to find at least 3-4 different formulations and seems like the preference is largely dependent on the application / field. I'd argue that compressing the weights between 0.0 and 1.0 is better for the reasons mentioned above. Also, most people won't notice the change because the relative strength of the weights hasn't changed, so all the existing weighted statistics methods will return the same result:
If we want to consider this a fully breaking change then I'm open to switching the meaning of |
OK. I don't have a strong preference, but can you check what's the most common behavior and definition of |
From what I can tell, the current meaning of
https://www.itl.nist.gov/div898/handbook/pmc/section3/pmc324.htm
https://otexts.com/fpp2/ses.html A few other implementations: |
OK, cool. What about the normalization to 0:1 that this PR introduces? |
I'm having a harder time finding a consistent pattern for that. Many algorithms seem to just have large numbers and others normalize the weights so that they sum to 1.0. I'm not sure which is the best approach, but I feel like having large floating point weights probably isn't a good idea if we can compress them. I could be convinced that having them sum to 1.0 is a good idea. |
Mmm, hard to decide then. Is there disagreement among major implementations (R, NumPy...)? An alternative possibility is to have a keyword argument to normalize, but keep it |
This issue is that most other libraries just have an |
OK, as you prefer. Having a keyword argument sounds good, whatever the default we retain. |
It is important to document how does the sum of the weights gets impacted by the normalization. If not mistaken, the sum of weights is not consistent and will depend on whether that weight type normalizes or not. For now, it should at least be mentioned in the docs. |
FWIW, the initial version of this, that I wrote for our internal code base, would produce weights that summed to 1.0. I think @ararslan was the one who thought that was wrong, so maybe I can find the discussion? |
Alright, couldn't find any discussion, but the original code we had internal looked like this:
|
Do you want to merge this in the end or not? In the perspective of moving features to Statistics (https://github.com/JuliaLang/Statistics.jl/issues/87) I'm checking which breaking changes we may want to do (as after that it will be too late). |
Sorry, I wasn't clear if further consensus was needed for this PR. Sure, I can fix this up for a breaking release. I think it's probably still worth having both the values and the smoothing factor stay between 0 and 1 given the links I posted above. That being said, we could do something like what pandas does and accept multiple parameters either the normal smoothing parameter |
AFAICT there are two different issues here, right?
|
Alright, I've revised this PR to be non-breaking by using a |
The modified function is equivalent to dividing all weights by the largest value. ``` julia> x = [ 0.3 0.42857142857142855 0.6122448979591837 0.8746355685131197 1.249479383590171 1.7849705479859588 2.549957925694227 3.642797036706039 5.203995766722913 7.434279666747019 ] 10-element Array{Float64,1}: 0.3 0.42857142857142855 0.6122448979591837 0.8746355685131197 1.249479383590171 1.7849705479859588 2.549957925694227 3.642797036706039 5.203995766722913 7.434279666747019 julia> x ./ last(x) 10-element Array{Float64,1}: 0.04035360699999998 0.057648009999999965 0.08235429999999996 0.11764899999999996 0.16806999999999994 0.24009999999999995 0.34299999999999997 0.49 0.7 1.0 julia> using StatsBase [ Info: Recompiling stale cache file /Users/rory/.julia/compiled/v1.0/StatsBase/EZjIG.ji for StatsBase [2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91] julia> eweights(1:10, 0.3) 10-element Weights{Float64,Float64,Array{Float64,1}}: 0.04035360699999998 0.05764800999999997 0.08235429999999996 0.11764899999999996 0.16806999999999994 0.24009999999999995 0.3429999999999999 0.48999999999999994 0.7 1.0 ```
…h a `scaled::DepBool` kwarg.
src/weights.jl
Outdated
- https://en.wikipedia.org/wiki/Exponential_smoothing | ||
``` | ||
""" | ||
function eweights(t::AbstractVector{<:Integer}, λ::Real, n::Integer; scaled::DepBool=nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this scale
? We already use this term for transformations (https://juliastats.org/StatsBase.jl/latest/transformations/).
src/weights.jl
Outdated
eweights(t::AbstractVector{<:Integer}, λ::Real) | ||
eweights(t::AbstractVector{T}, r::StepRange{T}, λ::Real) where T | ||
eweights(n::Integer, λ::Real) | ||
eweights(t::AbstractVector{<:Integer}, λ::Real, [n::Integer]; scaled=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to allow passing n
separately here? Isn't it always equal to length(t)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't always hold for cases where you're skipping elements (ie: [1, 2, 4, 5]
, but you want n
to be 5
) The most common use case for us is passing a set of actual dates t
and specifying our expect date range r
. The previous code wouldn't handle missing
entries in t
correctly, hence why I added this addition and didn't document it on the first pass (used with eweights(t, r λ)
).
cf18392
to
621a856
Compare
NOTE: Nightly seems to be failing for an unrelated reason. |
src/weights.jl
Outdated
@@ -227,37 +231,55 @@ For each element `i` in `t` the weight value is computed as: | |||
As this value approaches 0, the resulting weights will be almost equal, | |||
while values closer to 1 will put greater weight on the tail elements of the vector. | |||
|
|||
# Keyword arguments | |||
|
|||
- `scale::Bool`: Whether are not to return the scaled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete sentence. Also repeat the default value.
src/weights.jl
Outdated
If an integer `n` is provided, weights are generated for values from 1 to `n` | ||
(equivalent to `t = 1:n`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an integer `n` is provided, weights are generated for values from 1 to `n` | |
(equivalent to `t = 1:n`). | |
If an integer `n` is provided and `t` is omitted, weights are generated for values from 1 to `n` | |
(equivalent to `t = 1:n`). |
Should also describe what's the default value for n
when t
is provided.
BTW, the order of t
, n
and λ
is inconsistent across signatures, but I guess we can't do much about it without breaking backward compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. I added the (t, λ, n)
method more out of convenience, but if you think it would be more consistent I could merge it with the existing (t, λ)
method which just calls extrema
to get the default n
value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I believe the signatures are the same now, apart from the scaled
kwarg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks. The value of n
when only t
is provided still needs documentation though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. FWIW, I'm starting to dislike the n
and r
convenience constructors. The range syntax is pretty terse already, and something.(indexin(t, r))
doesn't seem that verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to make a deprecation PR if you would prefer dropping this in the next breaking release. Though I'm not sure they really hurt, as long as they don't introduce inconsistencies or ambiguities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they don't hurt and they are logical extensions of the primary method, it's just that this docstring feels a little verbose now :)
FWIW, I think we can remove the breaking label for this PR. I'm pretty sure we're just introducing a deprecation now. |
src/weights.jl
Outdated
The integer value of `n` represents the number of past observations to consider, which defaults to: | ||
|
||
1. `maximum(t) - minimum(t) + 1` if only `t` is passed in and the elements are integers | ||
2. `length(r)` if a superset range `r` is passed in | ||
3. `n` is explicitly passed instead of `t` (equivalent to `t = 1:n`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This describes default values for both n
and t
, so I suggest changing the wording a bit. For example:
The integer value of `n` represents the number of past observations to consider, which defaults to: | |
1. `maximum(t) - minimum(t) + 1` if only `t` is passed in and the elements are integers | |
2. `length(r)` if a superset range `r` is passed in | |
3. `n` is explicitly passed instead of `t` (equivalent to `t = 1:n`) | |
The integer value `n` represents the number of past observations to consider. | |
`n` defaults to `maximum(t) - minimum(t) + 1` if only `t` is passed in | |
and the elements are integers, and to `length(r)` if a superset range `r` | |
is also passed in. | |
If `n` is explicitly passed instead of `t`, `t` defaults to `1:n`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let's merge this and then we can file an issue/PR making the breaking change for the next breaking release.
Okay, if there aren't any other comments I can merge this tomorrow. Do we typically bump the release version in the PR here? I also appear to have merge permissions, but I don't know if I can register the change? |
Feel free to bump the patch version in the PR. Registering it should also work. |
The modified function is equivalent to dividing all weights by the largest value.