-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split AnalyticWeights
into several types
#758
Comments
Indeed we really need to clarify the definition of weights. Though I'm not sure whether we should deprecate Also, before adding a new weights type, I'd like to know whether that exists in other implementations. If not, we have to consider very seriously why. FWIW, here's the link to SAS's documentation for the |
The current implementation uses the following definition/assumptions:
I haven't seen extra weight types like this in other languages. Why that is varies; sometimes it's solved using keyword arguments, sometimes by clearly documenting in each function what the weights mean. |
OK. Do you have examples of either case? |
Why does the result depend on the scale of weights if the number of observations is taken to be the length? :-/ |
For the current implementation, it doesn't. However, for most implementations outside of Hmisc (e.g. Stata) the scale does matter. The docstring also implies the scale should matter by giving the following examples of how
|
As an example of documenting this, Stata explains here how the weights behave, noting that:
Hmisc uses a keyword argument, where the sample size for a weighted variance is treated as |
Interesting link about Stata. So what I understand from there is that analytic weights in Stata are really used for the case where cases are averages of independent observations, but the estimate of the mean squared error given by Our docstring is modeled after Stata's analytic weights, so it inherits its strengths and weaknesses. AFAICT it's not plain wrong to say that So what I seem to understand is that we could keep |
Yes, although even here we run up against several ambiguities:
I believe Stata's choice is mostly motivated by their defining the standard error as
I personally think the best solution is to create several types of weights we can document unambiguously, while leaving |
OK. Could you write a detailed proposal regarding which formulas these new weights types would use in |
Sure; but actually, let me think about this a little bit more, since I'm starting to wonder how much of this should go in StatsBase and how much should go in a different package for meta-analysis. |
Ok, thinking about it, I think most of the weights types can be folded into The only type I can't imagine figuring out how to fold in is inverse-variance weights that are known exactly ahead of time, but that's definitely an extremely rare use case. |
OK. So you suggest we store the sample size as a separate field, defaulting to |
Correct; that way, |
@gragusa This discussion is probably relevant for JuliaStats/GLM.jl#487. |
The more I've thought about it, the more I think the simplest solution might be best: clearly defining This would require deprecating any variance methods that allow passing |
This definition sounds fine as it's quite precise. Though as long as the estimator makes some sense, I wouldn't completely deprecate it, but instead add a warning to the docs. Also, are you sure all our current methods are consistent with that definition? They would have to be scale-dependent, right? Do you still think we would need to add another weights type like |
The problem is that asking for the variance here is ambiguous: are you asking for the variance across the aggregates, or across individual cases? One is much bigger than the other. There's also the problem that lots of users won't read the docs from top to bottom, especially since they think they know what
I don't think we'd have to, given that situations with a precisely-known variance are rare outside of textbook problems, but there's not much reason to exclude them. They're just not a very high priority IMO.
Methods would have to be scale-dependent for |
As mentioned when working on #754, it might make sense to split
AnalyticWeights
into two separate types of weights:PrecisionWeights
, which represent observations with a known variance. The weight for each observation should be equal to1/σ^2
.BatchWeights
, which represent observations that are themselves averages of several observations (as used in e.g. meta-analysis). Each weight should be an integer equal to the number of observations in the batch.It's possible to break these down further --
BatchWeights
might be broken down based on whether they're homogenous (identical means) or heterogenous (differing means), andPrecisionWeights
could be broken down based on whether each observation has a variance that is completely known or known only up to a constant.This would solve the current problem where the ambiguity of
AnalyticWeights
can result in nonsense answers -- for instance,var
currently returns an estimate that assumes a sample size equal tolength(weights)
, which is incorrect if each observation is actually an average of several sub-observations (the way the docstring forAnalyticWeights
says they should be used).The text was updated successfully, but these errors were encountered: