Weighted Arrays? #776

ParadaCarleton · 2022-03-23T15:45:35Z

I've been considering this for a while. Would it make sense to define a new struct, a weighted_array, which contains both an array and a set of weights? The primary advantages are as follows:

The weighted array can be stored contiguously in memory as an array of (element, weight) tuples. Weights and array elements are almost always accessed together, so this allows for faster access.
Allows weighted_arrays to be passed as a single argument in place of an array.
The user can conveniently manipulate weights together with observations. For example, dropping missing values would also automatically drop the weights associated with them.
(The old interface can also be kept.)

The text was updated successfully, but these errors were encountered:

nalimilan · 2022-03-26T13:58:49Z

There's been some discussion about something similar at JuliaLang/julia#33310 (comment) (and following comments) and https://github.com/JuliaLang/Statistics.jl/issues/88. It could be an interesting alternative to passing weights as a separate argument. But I find the syntax a bit weird with functions that take several arguments, like cor(weighted(w, x), weighted(w, y)) or (more compact but weirder) cor(weighted(w, x, y)) -- and ideally we want to have a consistent syntax for single- and multiple-argument functions. Of course we could support two different syntaxes, but for now I'd rather focus on getting a single syntax work correctly in all cases (notably skipping missing values).

Regarding performance, it would probably not be faster:

It's not clear that it would be faster to store (element, weight) tuples. When processing arrays in loops, AFAIK it's easier to get the compiler to use SIMD instructions when working on two separate arrays. And if you combine e.g. an Int8 value with a Float64 weight, you have to add some padding in the array to ensure all elements are aligned:

julia> Base.summarysize(fill((Int8(1), 1.0), 10_000))
160040

julia> Base.summarysize(fill(Int8(1), 10_000)) + Base.summarysize(fill(1.0, 10_000))
90080

Even if it was faster, having weighted_array make a copy of the values and weights to allocate a vector of (element, weight) tuples would be prohibitively slow if you need to compute weighted stats on different variables. That said, we could implement such a wrapper which would be a view of the inputs (like AbstractWeights currently).

perrette · 2024-06-13T09:53:16Z

It's a semantic issue rather than a syntax issue. Think about x = WeightedArray( a, w) and y = WeightedArray ( b, v). What should cov(x, y) return?

If one could define a meaningful operation then the semantic problem would be gone entirely.

E.g.

x_centered = a - ... # uses w weight
y_centered = b - ... # uses v weight 
cov = ... # uses sqrt (v * w) ???

Here the definition of sqrt(v *w) does the trick of making it consistent with cov(x, x) but it is not satisfying because I don't think it has a statistical meaning. There might be a solution to this, but until one is found, I would simply not extend cov or other similarly problematic multivariate operators to weighted arrays.

But I'm not a fan of hiding things by putting everything into classes. Long live Vector functions!

Coming from python, I find having to define classes for weights and for weighted arrays is not very transparent, because it is a judgement about the "quality" of the objects instead of its "function", which I find is problematic in general, in coding (lack of transparency) like in society. I would much prefer having separate functions such as wmedian (possibly with type 1, ... , 7?? as additional parameter) that take Vector typed values and weights as separate arguments, like it is now, but also without the Weight classes really (or at least also accept Vector and internally convert to Weight?). In any case the distinction between Frequency and Probability Weights already goes too far IMO.

sethaxen mentioned this issue May 9, 2023

Make the API consistent with Statistics.jl/StatsBase.jl circstat/CircStats.jl#9

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weighted Arrays? #776

Weighted Arrays? #776

ParadaCarleton commented Mar 23, 2022

nalimilan commented Mar 26, 2022

perrette commented Jun 13, 2024

Weighted Arrays? #776

Weighted Arrays? #776

Comments

ParadaCarleton commented Mar 23, 2022

nalimilan commented Mar 26, 2022

perrette commented Jun 13, 2024