-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a skipmissing argument to corkendall #683
Comments
Thanks for the proposal. I think we first need to find a solution for |
Starting from where I've got to, I think it would be fairly easy to create a function
where
What I'm not sure about is whether such a solution could live in StatsBase, or whether it would have to be in Statistics, which is perhaps lower in the hierarchy. Can two packages call functions from one another in Julia? Could something along these lines serve as the solution you are looking for? |
The problem is not so much the internal implementation. It's rather whether we want to add a |
OK I understand. So I had a go at creating a method of my To be explicit, given a function I've tested it lightly and it does appear to work but I've not done any performance testing at all (it certainly sacrifices the trick to speedup
Apart from the definition of |
I've simplified the code quite a bit and renamed the function to and it's now in a different file: skipmissingpairwise.jl |
Following #627 I think this issue should be closed. I'll open a new issue suggesting how to get |
With help from @nalimilan, back in February I contributed a re-write of
corkendall
that yielded quite useful speedups. (see this commit. But I need a version ofcorkendall
that supports missing elements, whichStatsBase.corkendall
doesn't support at present. Hence my (public, but not registered) package KendallTau.I would like to propose that
corkendall
in StatsBase be replaced by code copied from KendallTau, but I know that how best to handle missing values in StatsBase has been a topic of various discussions, including a more general approach than my proposal here so I'd like to know if a pull request on these lines might be accepted.This is the approach I have taken:
written an auxiliary function
skipmissingpairs
that takes a pair of arguments each of one of the above two types and does the needful to implement both a "pairwise" and "complete" handling of missings, i.e. a return a pair of vectors\matrices with the missings (or rows containing missings) removed and also type narrowed toRealVector
orRealMatrix
.amended the methods of
corkendall
to have signatures such as:The proposed change would be non-breaking on existing code, which must be passing arguments of type
RealVector
and\orRealMatrix
. Also when one or more of the arguments is of typeRealOrMissing...
thenskipmissing
defaults to:undefined
thus forcing the user to decide whether:pairwise
or:complete
is the better approach for the data they have.Other changes I have made:
Slightly improved consistency with
corspearman
, for example errors when dimensions of arguments don't match now align between the two functions. Obviously this proposal will takecorkendall
"ahead" ofcorspearman
, but I think it would be straightforward to make equivalent changes tocorspearman
, hopefully without code duplication.Some code refactoring that gives an approx 10% further speedup to corkendall, and approx 40% reduction in memory allocation. The trick was to reduce the number of calls to
permute!
in all but the vector-vector case.Amended the tests to test behaviour against missing values. I also have a test suite testing corkendall against
corkendal_naive
that, for simplicity, does not use Knight's algorithm. That gives confidence that results fromcorkendall
are correct for a wide variety of argument types and values but I don't propose to port that test suite to StatsBase - it's too complicated.Could a member of StatsBase let me know their thoughts?
The text was updated successfully, but these errors were encountered: