Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a skipmissing argument to corkendall #683

Closed
PGS62 opened this issue Apr 20, 2021 · 6 comments
Closed

Add a skipmissing argument to corkendall #683

PGS62 opened this issue Apr 20, 2021 · 6 comments

Comments

@PGS62
Copy link
Contributor

PGS62 commented Apr 20, 2021

With help from @nalimilan, back in February I contributed a re-write of corkendall that yielded quite useful speedups. (see this commit. But I need a version of corkendall that supports missing elements, which StatsBase.corkendall doesn't support at present. Hence my (public, but not registered) package KendallTau.

I would like to propose that corkendall in StatsBase be replaced by code copied from KendallTau, but I know that how best to handle missing values in StatsBase has been a topic of various discussions, including a more general approach than my proposal here so I'd like to know if a pull request on these lines might be accepted.

This is the approach I have taken:

  1. Defined types:
const RealOrMissingVector{T <: Real} = AbstractArray{<:Union{T, Missing},1}
const RealOrMissingMatrix{T <: Real} = AbstractArray{<:Union{T, Missing},2}
  1. written an auxiliary function skipmissingpairs that takes a pair of arguments each of one of the above two types and does the needful to implement both a "pairwise" and "complete" handling of missings, i.e. a return a pair of vectors\matrices with the missings (or rows containing missings) removed and also type narrowed to RealVector or RealMatrix.

  2. amended the methods of corkendall to have signatures such as:

function corkendall(X::Union{RealMatrix,RealOrMissingMatrix},
                    y::Union{RealVector,RealOrMissingVector};
                    skipmissing::Symbol=:undefined)

The proposed change would be non-breaking on existing code, which must be passing arguments of type RealVector and\or RealMatrix. Also when one or more of the arguments is of type RealOrMissing... then skipmissing defaults to :undefined thus forcing the user to decide whether :pairwise or :complete is the better approach for the data they have.

Other changes I have made:

  1. Slightly improved consistency with corspearman, for example errors when dimensions of arguments don't match now align between the two functions. Obviously this proposal will take corkendall "ahead" of corspearman, but I think it would be straightforward to make equivalent changes to corspearman, hopefully without code duplication.

  2. Some code refactoring that gives an approx 10% further speedup to corkendall, and approx 40% reduction in memory allocation. The trick was to reduce the number of calls to permute! in all but the vector-vector case.

  3. Amended the tests to test behaviour against missing values. I also have a test suite testing corkendall against corkendal_naive that, for simplicity, does not use Knight's algorithm. That gives confidence that results from corkendall are correct for a wide variety of argument types and values but I don't propose to port that test suite to StatsBase - it's too complicated.

Could a member of StatsBase let me know their thoughts?

@nalimilan
Copy link
Member

Thanks for the proposal. I think we first need to find a solution for cor in Statistics. Then we'll just have to follow the same pattern in StatsBase. Unfortunately this issue is quite tricky...

@PGS62
Copy link
Contributor Author

PGS62 commented Apr 21, 2021

Starting from where I've got to, I think it would be fairly easy to create a function corgeneral with signature:

function corgeneral(fn::Function, X::Union{RealMatrix,RealOrMissingMatrix},
                    y::Union{RealVector,RealOrMissingVector};
                    skipmissing::Symbol=:undefined)

where fn is any function that takes two RealVectors and returns a Float64. So it would apply for cor, corkendall, corspearman or other correlations such as those described here. Possibly, there could be one additional argument, maybe sortfirst::Bool which could implement the "sort outside the loop" trick that I used to speedup corkendall, or more generally a applythisfunctionoutsidetheloop::Function argument.

corgeneral could either be documented and exported from StatsBase or else be an internal function used to give corspearman, corkendall etc. the capability to handle missings in both :pairwise and :complete style (or both exported and used...).

What I'm not sure about is whether such a solution could live in StatsBase, or whether it would have to be in Statistics, which is perhaps lower in the hierarchy. Can two packages call functions from one another in Julia?

Could something along these lines serve as the solution you are looking for?

@nalimilan
Copy link
Member

The problem is not so much the internal implementation. It's rather whether we want to add a skipmissing argument (which currently doesn't exist anywhere in Base nor Statistics) or whether we can use a wrapper like skipmissings.

@PGS62
Copy link
Contributor Author

PGS62 commented Apr 21, 2021

OK I understand. So I had a go at creating a method of my skipmissingpairs function that does what you're after, or at least it does if I've understood the problem correctly.

To be explicit, given a function fn (example: StatsBase.corkendall) skipmissingpairs returns a function with the five methods we need (Matrix, Matrix-Matrix, Vector-Matrix, etc) that does exactly what we are after, i.e. pair-wise skipping of missing elements and calling fn as the correlation aka reduction function.

I've tested it lightly and it does appear to work but I've not done any performance testing at all (it certainly sacrifices the trick to speedup corkendall).

julia> x = [missing;1;2;5;4;missing];

julia> y = [1;3;missing;2;3;6];

julia> hcat(x,y)
6×2 Matrix{Union{Missing, Int64}}:
  missing  1
 1         3
 2          missing
 5         2
 4         3
  missing  6

julia> KendallTau.skipmissingpairs(StatsBase.corkendall)(x,y)
-0.8164965809277261

julia> StatsBase.corkendall([1;5;4],[3;2;3])
-0.8164965809277261

Apart from the definition of RealOrMissingVector and RealOrMissingMatrix, all the code is in this one file.

@PGS62
Copy link
Contributor Author

PGS62 commented Apr 22, 2021

I've simplified the code quite a bit and renamed the function to skipmissingpairwise (longer name but clearer I hope).

and it's now in a different file: skipmissingpairwise.jl

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 9, 2023

Following #627 I think this issue should be closed. I'll open a new issue suggesting how to get pairwise to play nice with corkendall and corspearman.

@PGS62 PGS62 closed this as completed Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants