Description
With help from @nalimilan, back in February I contributed a re-write of corkendall
that yielded quite useful speedups. (see this commit. But I need a version of corkendall
that supports missing elements, which StatsBase.corkendall
doesn't support at present. Hence my (public, but not registered) package KendallTau.
I would like to propose that corkendall
in StatsBase be replaced by code copied from KendallTau, but I know that how best to handle missing values in StatsBase has been a topic of various discussions, including a more general approach than my proposal here so I'd like to know if a pull request on these lines might be accepted.
This is the approach I have taken:
- Defined types:
const RealOrMissingVector{T <: Real} = AbstractArray{<:Union{T, Missing},1}
const RealOrMissingMatrix{T <: Real} = AbstractArray{<:Union{T, Missing},2}
-
written an auxiliary function
skipmissingpairs
that takes a pair of arguments each of one of the above two types and does the needful to implement both a "pairwise" and "complete" handling of missings, i.e. a return a pair of vectors\matrices with the missings (or rows containing missings) removed and also type narrowed toRealVector
orRealMatrix
. -
amended the methods of
corkendall
to have signatures such as:
function corkendall(X::Union{RealMatrix,RealOrMissingMatrix},
y::Union{RealVector,RealOrMissingVector};
skipmissing::Symbol=:undefined)
The proposed change would be non-breaking on existing code, which must be passing arguments of type RealVector
and\or RealMatrix
. Also when one or more of the arguments is of type RealOrMissing...
then skipmissing
defaults to :undefined
thus forcing the user to decide whether :pairwise
or :complete
is the better approach for the data they have.
Other changes I have made:
-
Slightly improved consistency with
corspearman
, for example errors when dimensions of arguments don't match now align between the two functions. Obviously this proposal will takecorkendall
"ahead" ofcorspearman
, but I think it would be straightforward to make equivalent changes tocorspearman
, hopefully without code duplication. -
Some code refactoring that gives an approx 10% further speedup to corkendall, and approx 40% reduction in memory allocation. The trick was to reduce the number of calls to
permute!
in all but the vector-vector case. -
Amended the tests to test behaviour against missing values. I also have a test suite testing corkendall against
corkendal_naive
that, for simplicity, does not use Knight's algorithm. That gives confidence that results fromcorkendall
are correct for a wide variety of argument types and values but I don't propose to port that test suite to StatsBase - it's too complicated.
Could a member of StatsBase let me know their thoughts?