Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighted sem #754

Merged
merged 66 commits into from
Feb 6, 2022
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
f2f83cf
Add weighted `sem`
ParadaCarleton Jan 15, 2022
46be952
add weighted sem method
ParadaCarleton Jan 15, 2022
158fb35
Correct missing method
ParadaCarleton Jan 17, 2022
43c7419
bug fix
ParadaCarleton Jan 17, 2022
e770708
Bug fix
ParadaCarleton Jan 17, 2022
e5a03a8
Update src/scalarstats.jl
ParadaCarleton Jan 17, 2022
82e4de2
Update src/scalarstats.jl
ParadaCarleton Jan 17, 2022
5eba9d4
Update src/scalarstats.jl
ParadaCarleton Jan 17, 2022
2ef18eb
change μ to mean
ParadaCarleton Jan 18, 2022
1307200
Update src/scalarstats.jl
ParadaCarleton Jan 18, 2022
7aafe38
broadcast weights
ParadaCarleton Jan 18, 2022
aed6531
Update src/scalarstats.jl
ParadaCarleton Jan 19, 2022
e49430c
Work with arbitrary iterators, fuse loops
ParadaCarleton Jan 19, 2022
99d8505
Update src/scalarstats.jl
ParadaCarleton Jan 19, 2022
5cd2e4d
Update src/scalarstats.jl
ParadaCarleton Jan 19, 2022
43e354f
Apply review comments
ParadaCarleton Jan 19, 2022
44d0b0e
RealArray instead of Array{<:Number}
ParadaCarleton Jan 19, 2022
094c30c
Improve citation
ParadaCarleton Jan 22, 2022
a3c7e61
Apply code review suggestions
ParadaCarleton Jan 23, 2022
5d25c7a
whitespace
ParadaCarleton Jan 23, 2022
59f8b8e
Update src/scalarstats.jl
ParadaCarleton Jan 23, 2022
73cfb64
Update src/scalarstats.jl
ParadaCarleton Jan 23, 2022
af78ffa
Update src/scalarstats.jl
ParadaCarleton Jan 23, 2022
9264d0a
Update src/scalarstats.jl
ParadaCarleton Jan 23, 2022
50ee853
Update src/scalarstats.jl
ParadaCarleton Jan 23, 2022
c83beea
Update src/scalarstats.jl
ParadaCarleton Jan 23, 2022
4f25268
Update src/scalarstats.jl
ParadaCarleton Jan 23, 2022
9ef970a
Apply suggestions
ParadaCarleton Jan 23, 2022
1ea8ef6
formatting
ParadaCarleton Jan 23, 2022
3f2361f
Update src/scalarstats.jl
ParadaCarleton Jan 23, 2022
b882c49
Drop real requirement
ParadaCarleton Jan 24, 2022
18cf255
Update src/scalarstats.jl
ParadaCarleton Jan 24, 2022
8b556bc
Test mean keyword with FrequencyWeights
ParadaCarleton Jan 24, 2022
dcff0b1
Update src/scalarstats.jl
ParadaCarleton Jan 24, 2022
162631b
Update src/scalarstats.jl
ParadaCarleton Jan 24, 2022
d213c93
Update src/scalarstats.jl
ParadaCarleton Jan 24, 2022
a93a15d
Use original `sem` implementation
ParadaCarleton Jan 24, 2022
74d2c3c
use original SEM implementation
ParadaCarleton Jan 24, 2022
adaf472
original sem implementation
ParadaCarleton Jan 24, 2022
acf81ea
original sem implementation
ParadaCarleton Jan 24, 2022
303c514
test fix
ParadaCarleton Jan 24, 2022
5bd4a91
Dispatch on mean
ParadaCarleton Jan 27, 2022
8d904b5
remove internal method for weights
ParadaCarleton Jan 27, 2022
f762e37
Bug
ParadaCarleton Jan 27, 2022
529cd21
Apply code review
ParadaCarleton Jan 29, 2022
8afc950
Update src/scalarstats.jl
ParadaCarleton Jan 29, 2022
989a383
Fix tests
ParadaCarleton Jan 29, 2022
b469d6d
Return NaN for FrequencyWeights
ParadaCarleton Jan 29, 2022
e25c4b0
Type instability fix
ParadaCarleton Jan 30, 2022
d56a98f
Fix array version to return NaN instead of error
ParadaCarleton Jan 30, 2022
22c8cc0
Remove useless tests
ParadaCarleton Jan 30, 2022
6b90e77
Update src/scalarstats.jl
ParadaCarleton Jan 31, 2022
f3c9714
Update src/scalarstats.jl
nalimilan Feb 1, 2022
cfba9a1
Update test/scalarstats.jl
nalimilan Feb 1, 2022
d024475
Update src/scalarstats.jl
ParadaCarleton Feb 2, 2022
b64d44d
Add test
ParadaCarleton Feb 2, 2022
e31748a
Update src/scalarstats.jl
ParadaCarleton Feb 2, 2022
72d1e48
Fix type instability
ParadaCarleton Feb 2, 2022
6ceeac4
Bug fix
ParadaCarleton Feb 2, 2022
5779144
Test for type stability
ParadaCarleton Feb 2, 2022
a520d20
Fix type instability
ParadaCarleton Feb 3, 2022
7b40f1c
mean type fix
ParadaCarleton Feb 3, 2022
4683ec4
test for type inference
ParadaCarleton Feb 3, 2022
ab29aa7
Update src/scalarstats.jl
ParadaCarleton Feb 5, 2022
1f32003
Update src/scalarstats.jl
ParadaCarleton Feb 5, 2022
632e212
Update src/scalarstats.jl
ParadaCarleton Feb 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 90 additions & 25 deletions src/scalarstats.jl
Original file line number Diff line number Diff line change
Expand Up @@ -261,36 +261,101 @@ realXcY(x::Real, y::Real) = x*y
realXcY(x::Complex, y::Complex) = real(x)*real(y) + imag(x)*imag(y)

"""
sem(x)
sem(x; mean=nothing)
sem(x::AbstractArray[, weights::AbstractWeights]; mean=nothing)

Return the standard error of the mean of collection `x`,
i.e. `sqrt(var(x, corrected=true) / length(x))`.
Return the standard error of the mean for a collection `x`.
A pre-computed `mean` may be provided.

When not using weights, this is the (sample) standard deviation
divided by the sample size. If weights are used, the
variance of the sample mean is calculated as follows:

* `AnalyticWeights`: Not implemented.
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
* `FrequencyWeights`: ``\\frac{\\sum_{i=1}^n w_i (x_i - \\bar{x_i})^2}{(\\sum w_i) (\\sum w_i - 1)}``
* `ProbabilityWeights`: ``\\frac{n}{n-1} \\frac{\\sum_{i=1}^n w_i^2 (x_i - \\bar{x_i})^2}{\\left( \\sum w_i \\right)^2}``
ParadaCarleton marked this conversation as resolved.
Show resolved Hide resolved

The standard error is then the square root of the above quantities.

# References

Carl-Erik Särndal, Bengt Swensson, Jan Wretman (1992). Model Assisted Survey Sampling.
New York: Springer. pp. 51-53.
"""
function sem(x)
y = iterate(x)
if y === nothing
function sem(x; mean=nothing)
if isempty(x)
# Return the NaN of the type that we would get for a nonempty x
T = eltype(x)
# Return the NaN of the type that we would get, had this collection
# contained any elements (this is consistent with std)
return oftype(sqrt((abs2(zero(T)) + abs2(zero(T)))/2), NaN)
end
count = 1
value, state = y
y = iterate(x, state)
# Use Welford algorithm as seen in (among other places)
# Knuth's TAOCP, Vol 2, page 232, 3rd edition.
M = value / 1
S = real(zero(M))
while y !== nothing
_mean = (mean === nothing) ? zero(T) / 1 : mean
ParadaCarleton marked this conversation as resolved.
Show resolved Hide resolved
z = abs2(zero(T) - _mean)
return oftype((z + z) / 2, NaN)
elseif mean === nothing
n = 0
y = iterate(x)
value, state = y
# Use Welford algorithm as seen in (among other places)
# Knuth's TAOCP, Vol 2, page 232, 3rd edition.
_mean = value / 1
sse = real(zero(_mean))
while y !== nothing
value, state = y
y = iterate(x, state)
n += 1
new_mean = _mean + (value - _mean) / n
sse += realXcY(value - _mean, value - new_mean)
_mean = new_mean
end
else
n = 1
y = iterate(x)
value, state = y
y = iterate(x, state)
count += 1
new_M = M + (value - M) / count
S = S + realXcY(value - M, value - new_M)
M = new_M
sse = abs2(value - mean)
while (y = iterate(x, state)) !== nothing
value, state = y
n += 1
sse += abs2(value - mean)
end
end
variance = sse / (n - 1)
return sqrt(variance / n)
end

function sem(x::AbstractArray; mean=nothing)
if isempty(x)
# Return the NaN of the type that we would get for a nonempty x
T = eltype(x)
_mean = (mean === nothing) ? zero(T) / 1 : mean
ParadaCarleton marked this conversation as resolved.
Show resolved Hide resolved
z = abs2(zero(T) - _mean)
return oftype((z + z) / 2, NaN)
end
return sqrt(var(x; mean=mean, corrected=true) / length(x))
end

function sem(x::AbstractArray, weights::UnitWeights; mean=nothing)
ParadaCarleton marked this conversation as resolved.
Show resolved Hide resolved
if length(x) ≠ length(weights)
throw(DimensionMismatch("array and weights do not have the same length"))
ParadaCarleton marked this conversation as resolved.
Show resolved Hide resolved
end
return sem(x; mean=mean)
end


# Weighted methods for the above
sem(x::AbstractArray, weights::FrequencyWeights; mean=nothing) =
sqrt(var(x, weights; mean=mean, corrected=true) / sum(weights))

function sem(x::AbstractArray, weights::ProbabilityWeights; mean=nothing)
if isempty(x)
# Return the NaN of the type that we would get for a nonempty x
return var(x, weights; mean=nothing, corrected=true) / 0
ParadaCarleton marked this conversation as resolved.
Show resolved Hide resolved
else
_mean = (mean === nothing) ? Statistics.mean(x, weights) : mean
ParadaCarleton marked this conversation as resolved.
Show resolved Hide resolved
# sum of squared errors = sse
sse = sum(Broadcast.instantiate(Broadcast.broadcasted(x, weights) do x_i, w
return abs2(w * (x_i - _mean))
end))
n = count(!iszero, weights)
return sqrt(sse * n / (n - 1)) / sum(weights)
end
var = S / (count - 1)
return sqrt(var/count)
end

# Median absolute deviation
Expand Down
29 changes: 25 additions & 4 deletions test/scalarstats.jl
Original file line number Diff line number Diff line change
Expand Up @@ -107,10 +107,31 @@ z2 = [8. 2. 3. 1.; 24. 10. -1. -1.; 20. 12. 1. -2.]
@test variation([1:5;]) ≈ 0.527046276694730
@test variation(skipmissing([missing; 1:5; missing])) ≈ 0.527046276694730

@test sem([1:5;]) ≈ 0.707106781186548
@test sem(skipmissing([missing; 1:5; missing])) ≈ 0.707106781186548
@test sem(Int[]) === NaN
@test sem(skipmissing(Union{Int,Missing}[missing, missing])) === NaN
@test @inferred(sem([1:5;])) ≈ 0.707106781186548
@test @inferred(sem(skipmissing([missing; 1:5; missing]))) ≈ 0.707106781186548
@test @inferred(sem(skipmissing([missing; 1:5; missing]), mean=3.0)) ≈ 0.707106781186548
@test @inferred(sem([1:5;], UnitWeights{Int}(5))) ≈ 0.707106781186548
@test @inferred(sem([1:5;], UnitWeights{Int}(5); mean=mean(1:5))) ≈ 0.707106781186548
@test_throws DimensionMismatch sem(1:5, UnitWeights{Int}(4))
@test @inferred(sem([1:5;], ProbabilityWeights([1:5;]))) ≈ 0.6166 rtol=.001
μ = mean(1:5, ProbabilityWeights([1:5;]))
@test @inferred(sem([1:5;], ProbabilityWeights([1:5;]); mean=μ)) ≈ 0.6166 rtol=.001
@test @inferred(sem([10; 1:5;], ProbabilityWeights([0; 1:5;]); mean=μ)) ≈ 0.6166 rtol=.001
x = sort!(vcat([5:-1:i for i in 1:5]...))
μ = mean(x)
@test @inferred(sem([1:5;], FrequencyWeights([1:5;]))) ≈ sem(x)
@test @inferred(sem([1:5;], FrequencyWeights([1:5;]); mean=μ)) ≈ sem(x)

@inferred sem([1:5f0;]; mean=μ) ≈ sem(x)
@inferred sem([1:5f0;], ProbabilityWeights([1:5;]); mean=μ)
@inferred sem([1:5f0;], FrequencyWeights([1:5;]); mean=μ)
# Broken: Bug to do with implementation of
# @inferred sem([1:5f0;], UnitWeights{Int}(5); mean=μ)

@test isnan(sem(Int[]))
@test isnan(sem(Int[], FrequencyWeights(Int[])))
@test isnan(sem(Int[], ProbabilityWeights(Int[])))
@test isnan(sem(skipmissing(Union{Int,Missing}[missing, missing])))
@test_throws MethodError sem(Any[])
@test_throws MethodError sem(skipmissing([missing]))

Expand Down