Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empirical CDF enhancements #590

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

alyst
Copy link
Member

@alyst alyst commented Aug 9, 2020

The PR improves the performance of ECDF and adds interpolation.

  • The major problem with the current code is that in the weighted case the partial sums have to be recalculated for each input. But this is not necessary, since all CDF values for weighted and non-weighted cases could be precalculated.
  • I suppose ECDF was written in pre-broadcast epoch, so it does not overload Base.Broadcast.broadcasted() for providing enhanced support for vectors. The PR deprecates ecdf(v::AbstractVector) in favor of standard dot notation. I've kept customized broadcasting, but now it just caches the CDF value of the last vector element.
  • Optionally, ECDF can enable interpolation (ecdf(..., interpolate=true)), which is handy for continuous empirical distributions. (It just linearly interpolates between the CDFs of adjacent values).

@alyst alyst mentioned this pull request Oct 5, 2020
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay. This looks nice. I can't comment on the math but here are remarks about the code.

@@ -121,8 +121,8 @@ aweights(vs::RealArray) = AnalyticWeights(vec(vs))
s = w.sum

if corrected
sum_sn = sum(x -> (x / s) ^ 2, w)
1 / (s * (1 - sum_sn))
sum_w2 = sum(abs2, w)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this here to avoid overflow?


### Deprecate August 2020 (v0.33)
function (ecdf::ECDF)(v::RealArray)
depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", "(ECDF::ecdf)(v::RealArray)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More common, and possibly faster:

Suggested change
depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", "(ECDF::ecdf)(v::RealArray)")
depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", :ecdf)

@@ -1,68 +1,142 @@
# Empirical estimation of CDF and PDF

## Empirical CDF
"""
Empirical Cumulative Distribution Function (ECDF).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Empirical Cumulative Distribution Function (ECDF).
ECDF{T <: Real, W <: Real, I}
Empirical Cumulative Distribution Function (ECDF).

# - `ECDF(x[i])`
# - `1/(x[i+1] - x[i])`
# - `ECDF(x[i+1]) - ECDF(x[i])` (the weight of `x[i+1]`)
sorted_values::Vector{Tuple{T, W, W, W}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it less efficient to store values as a vector of tuples instead of as several vectors? In terms of memory use AFAIK alignment may require padding between entries.

r[ord[i]] = weightsum
i += 1
# broadcasts ecdf() over an array
# caches the last calculated value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "caches"?

any(isnan, X) && throw(ArgumentError("ecdf can not include NaN values"))
isempty(weights) || length(X) == length(weights) || throw(ArgumentError("data and weight vectors must be the same size," *
"got $(length(X)) and $(length(weights))"))
evenweights = isnothing(weights) || isempty(weights)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inference works better with ===:

Suggested change
evenweights = isnothing(weights) || isempty(weights)
evenweights = weights === nothing || isempty(weights)

"got $(length(X)) and $(length(weights))"))
T = eltype(X)
W0 = evenweights ? Int : eltype(weights)
W = isnothing(weights) ? Float64 : eltype(one(W0)/sum(weights))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store sum(weights) to avoid calling it twice?

Comment on lines +117 to +123
push_valprev!() = push!(sorted_vals, (valprev, min(wsumprev/wsum, one(W)),
inv(val - valprev), valw/wsum))

@inbounds for i in ord
valnew = X[i]
if (val != valnew) || (i == last(ord))
(wsumprev > 0) && push_valprev!()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this anonymous function trigger boxing of these variables, making the loop quite slow?

Comment on lines +37 to +39
@test_skip fnecdf.weights == fnecdfalt.weights
@test_skip fnecdf.weights != w1 # check that w wasn't accidently modified in place
@test_skip fnecdfalt.weights != w2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

show(iobuf, obj)
return String(take!(iobuf))
end

@testset "ECDF" begin
x = randn(10000000)
fnecdf = ecdf(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change all of these to ecdf.(x), and copy this to deprecated.jl.

@ParadaCarleton
Copy link
Contributor

Is there anything here that's still worth it, or should I close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants