Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up Kendall Correlation corkendall #634

Closed
PGS62 opened this issue Jan 15, 2021 · 19 comments
Closed

Speeding up Kendall Correlation corkendall #634

PGS62 opened this issue Jan 15, 2021 · 19 comments

Comments

@PGS62
Copy link
Contributor

PGS62 commented Jan 15, 2021

I need to calculate the Kendall correlation of an array of size approximately 1,000 x 10,000 and calculate some statistics of the resulting 10,000 x 10,000 correlation matrix.

The current implementation of corkendall, even though it uses Knight's algorithm, is quite slow in this case. I estimate about five hours on my PC. I hope I'll be able to get that time down using multiple threads. Should be straightforward no?

But first I had a go at improving the current not-threaded implementation and have achieved a 4 to 5 times speed improvement. I think the tests could be beefed up a bit too. An obvious way to do that is to test against a very simple order N^2 implementation.

Would a pull request containing my suggested replacement of the current corkendall likely be accepted?

If it would help, I could make my code available as a standalone package as an interim step.

Thanks,

Philip

@nalimilan
Copy link
Member

Well without more details about potential downsides, it's hard to say "no" to this kind of offer. :-) Please copy the code somewhere or make a PR if you want more specific comments.

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 17, 2021

OK, I will do that within the next few days. The code is already in GitHub in a private repository of mine, so when I'm happy to share it I will make the repo public, or I could make a pull request.

It's fair to raise the question about potential downsides. I don't think there are any downsides to my re-write, but it's always possible. For example, I need to look into performance testing against slightly "pathological" input data such as that with very large numbers of repeats. Another question is memory usage. My version does allocate more memory (as reported by @btime) and I confess I haven't really worried about that.

Since making my previous post I have written code that can compare the execution speed and results from arbitrary implementations of Kendall Tau, against randomly generated input data. That testing revealed some bugs in my code (no surprise, now fixed) but also a couple of minor ones in the existing code.

For example, StatsBase.corkendall fails if passed a single argument being a matrix of integers, but does not fail if passed two matrices of integers:

julia> X = rand(1:100,5,5)
5×5 Array{Int64,2}:
  59  44  25  11  20
 100  60  66   6  65
  85  36  18  11  32
  20  86  31  98  61
  39  69   1  90  75

julia> StatsBase.corkendall(X,X)
5×5 Array{Float64,2}:
  1.0       -0.6        0.2       -0.948683  0.0
 -0.6        1.0        0.2        0.527046  0.4
  0.2        0.2        1.0       -0.316228  0.0
 -0.948683   0.527046  -0.316228   1.0       0.105409
  0.0        0.4        0.0        0.105409  1.0

julia> StatsBase.corkendall(X)
ERROR: InexactError: Int64(-0.5999999999999999)
Stacktrace:
 [1] Int64 at .\float.jl:710 [inlined]
 [2] convert at .\number.jl:7 [inlined]
 [3] setindex! at .\array.jl:849 [inlined]
 [4] corkendall(::Array{Int64,2}) at C:\Users\phili\.julia\packages\StatsBase\EA8Mh\src\rankcorr.jl:116
 [5] top-level scope at REPL[32]:1

The fix for that problem is at line 113, which currently reads:

C = Matrix{eltype(X)}(I, n, n)

but should read:

C = Matrix{float(eltype(X))}(I, n, n)

Also the function can return values outside the range -1 to 1, which is less than ideal:

julia> StatsBase.corkendall(collect(1:15),collect(1:15))
1.0000000000000002

I think a fix for that is to change line 93 from reading:

return (nD - nT - nU + nV - 2swaps!(y)) / (sqrt(nD - nT) * sqrt(nD - nU))

to instead read:

return (nD - nT - nU + nV - 2swaps!(y)) / sqrt((nD - nT) * (nD - nU))

which is obviously the same mathematically. The only way Kendall Tau can be one is for the ordering of x and y to be the same with no ties. In this case, nT == nU == nV == swaps!(y) == 0 so the new version of line 93 is guaranteed to return exactly 1 (I think).

I have also been looking at using threads to get yet better speed in the function, but so far I've been rather disappointed by my results - about a two-fold speedup on an eight core machine - much less than I was hoping for... I don't even know whether putting multi-threaded routines in StatsBase is considered a good idea. Perhaps best to try walking before running when it comes to making contributions to the Julia code base...

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 18, 2021

Hello again,

I have made the code public on GitHub (public, but not registered)
https://github.com/PGS62/KendallTau.jl

Perhaps a member of StatsBase could have a look and let me know their thoughts.

Philip

@nalimilan
Copy link
Member

Thanks. Looks interesting. A few remarks:

  • Could you explain in broad terms how the new algorithm differs from the old one? Do you have references about it?
  • How much more memory does the new algorithm use (not just in terms of allocations reported by @time but as the total memory used at a single point in time, notably via the number of copies made).
  • Why not use the sorting functions provided by Julia?
  • I see you tested with lots of repetitions via vectors with only 1 and 2. Do you get the same performance with a few more unique values (which is more realistic)?

Regarding multithreading, it could be an interesting addition to a more general pairwise function (#627). That way we wouldn't have to reimplement it for each function. But first better improve the single-thread performance.

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 18, 2021

Here goes:

Could you explain in broad terms how the new algorithm differs from the old one? Do you have references about it?

In some senses the new algorithm is the same. The aim is to calculate:

  1. Number of pairs, which is easy: n(n-1)/2 where n = length(x)
  2. Number of double ties i.e. how many pairs i,j exist such that x[i] = x[j] and y[i] = y[j]
  3. Number of ties in x i.e. how many pairs i,j exist such that x[i] = x[j]
  4. Number of ties in y i.e. how many pairs i,j exist such that y[i] = y[j]
  5. The number of "swaps".

Having those five numbers yields Kendall Tau via Knight's observation that

Tau = (npairs + ndoubleties - ntiesx - ntiesy - 2 * nswaps) /
sqrt((npairs - ntiesx) * (npairs - ntiesy))

What does "swaps" mean? The answer is how many swap operations would the (inefficient) bubble sort algorithm require to morph x's ranking into y's ranking. More concretely think of first forming the two column array z = hcat(x,y) then sorting (using whatever algorithm you like) on the first column. Next calculate how many bubble sort operations are required to sort z on its second column. If x and y had identical ordering to begin with then the number of bubble sort operations is zero.

The new algorithm is faster for two principal reasons:

Reason 1) The new corkendall! takes an optional third argument permx which defaults to sortperm(x). That's an advantage because when at least one argument to corkendall is a matrix, a call to sortperm can be taken outside the loop of calls to corkendall!. (See for example lines 80 to 85 of rankcorr.jl.). This observation explains why the new code provides a greater multiplicative speed up when at least one of the arguments is a matrix.

Reason 2) But the new code is also faster in the vector-vector case, when reason 1 doesn't apply. This is because in the new corkendall!, the call to the method mergesort! at line 44 does double duty. It sorts its first argument (necessary so that the call to countties on the following line will work) and also returns the number of swaps (needed for "Knight's formula").

By contrast the old code doesn't play this "double-duty" trick. It does three sorting operations, at lines 44 and 47 (obvious) but also at line 93 (not so obvious) in the call to swaps! which is really the merge sort algorithm in disguise but doing only "single duty"!

How much more memory does the new algorithm use (not just in terms of allocations reported by @time but as the total memory used at a single point in time, notably via the number of copies made).

I'm not really sure how to investigate that. Can you point me to a web page this some hints?

Why not use the sorting functions provided by Julia?

I think my ramblings above answered this. I'm using my implementation of mergesort because it does "double duty", which the julia implementation does not (why would it?). I did look at the implementation of mergesort in Julia but only after I wrote my implementation after looking at the Wikipedia page https://en.wikipedia.org/wiki/Merge_sort.

I see you tested with lots of repetitions via vectors with only 1 and 2. Do you get the same performance with a few more unique values (which is more realistic)?

No, but I will have a look soon.

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 19, 2021

An update.

Inspired by the mergesort implementation in Julia base (line 581 here) I have reduced memory usage (by which I mean total allocations reported by @btime) from about double that of StatsBase.corkendall to about 40% more.

I also discovered that the default value of small_threshold which is the vector length below which mergesort! resorts to calling insertionsort! has a marked effect on both memory usage and the speed of mergesort! (and hence of corkendall). I have chosen a default value of 64 which by my brief experimentation maximises speed. But a somewhat higher value (perhaps 128?) might be a better compromise between speed benefit and memory penalty versus the existing code. Also the equivalent value SMALL_THRESHOLD in base/sort.jl is set as low as 20 so I think that there might be other reasons why the new code is a bit more memory hungry than the old code.

I also investigated the impact of the "density of repeats" as @nalimilan suggested. This is via a new test harness called speedtest_repeatdensity that compares the speed of the two versions of corkendall when run against vectors of length (say) 2,000 whose elements are drawings from successively larger sets, first 1:4, next 1:16, next 1:64 etc up to 1:65536 (at which stage repeats are rare).

Here are the results

julia> KendallTau.speedtest_repeatdensity([StatsBase.corkendall,KendallT
###################################################################
Executing speedtest_repeatdensity 2021-01-19T18:14:41.358
--------------------------------------------------
maxelement = 4
length(vectorwithrepeats1) = 2000
length(unique(vectorwithrepeats1)) = 4
length(unique(vectorwithrepeats2)) = 4
StatsBase.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  492.401 μs (12 allocations: 157.53 KiB)
KendallTau.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  196.400 μs (86 allocations: 306.16 KiB)
timeratio = 2.507133401221996
resultsidentical = true
--------------------------------------------------
maxelement = 16
length(vectorwithrepeats1) = 2000
length(unique(vectorwithrepeats1)) = 16
length(unique(vectorwithrepeats2)) = 16
StatsBase.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  565.600 μs (12 allocations: 157.53 KiB)
KendallTau.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  217.100 μs (94 allocations: 374.20 KiB)
timeratio = 2.605251036388761
resultsidentical = true
--------------------------------------------------
maxelement = 64
length(vectorwithrepeats1) = 2000
length(unique(vectorwithrepeats1)) = 64
length(unique(vectorwithrepeats2)) = 64
StatsBase.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  669.499 μs (12 allocations: 157.53 KiB)
KendallTau.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  249.200 μs (110 allocations: 731.00 KiB)
timeratio = 2.6865930979133226
resultsidentical = true
--------------------------------------------------
maxelement = 256
length(vectorwithrepeats1) = 2000
length(unique(vectorwithrepeats1)) = 256
length(unique(vectorwithrepeats2)) = 256
StatsBase.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  737.401 μs (12 allocations: 157.53 KiB)
KendallTau.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  287.101 μs (300 allocations: 2.20 MiB)
timeratio = 2.5684375881658372
resultsidentical = true
--------------------------------------------------
maxelement = 1024
length(vectorwithrepeats1) = 2000
length(unique(vectorwithrepeats1)) = 857
length(unique(vectorwithrepeats2)) = 879
StatsBase.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  744.900 μs (12 allocations: 157.53 KiB)
KendallTau.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  319.500 μs (638 allocations: 4.84 MiB)
timeratio = 2.3314553990610327
resultsidentical = true
--------------------------------------------------
maxelement = 4096
length(vectorwithrepeats1) = 2000
length(unique(vectorwithrepeats1)) = 1555
length(unique(vectorwithrepeats2)) = 1599
StatsBase.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  723.701 μs (12 allocations: 157.53 KiB)
KendallTau.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  288.800 μs (437 allocations: 3.27 MiB)
timeratio = 2.505889889196676
resultsidentical = true
--------------------------------------------------
maxelement = 16384
length(vectorwithrepeats1) = 2000
length(unique(vectorwithrepeats1)) = 1887
length(unique(vectorwithrepeats2)) = 1880
StatsBase.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  712.699 μs (12 allocations: 157.53 KiB)
KendallTau.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  230.601 μs (155 allocations: 1.07 MiB)
timeratio = 3.090615391954068
resultsidentical = true
--------------------------------------------------
maxelement = 65536
length(vectorwithrepeats1) = 2000
length(unique(vectorwithrepeats1)) = 1969
length(unique(vectorwithrepeats2)) = 1978
StatsBase.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  709.399 μs (12 allocations: 157.53 KiB)
KendallTau.corkendall(vectorwithrepeats1,vectorwithrepeats2)
  201.100 μs (76 allocations: 459.00 KiB)
timeratio = 3.527593237195425
resultsidentical = true
###################################################################

I think these results look good. The new code is consistently faster, by a factor of 2.5 when both vectors contain values only in the range 1:4 to a factor of 3.5 when when values are in the range 1:65536. Recall that the vector-vector use of corkendall is the usage with the lowest speedup - we can't take a call to sortperm outside the loop in that case, as I described in my previous post.

One more thing. I mention here one deficiency of corkendall versus its opposite numbers in Python and R, namely the lack of control as to how nan values are handled. For my own objectives I need something similar to use = "pairwise.complete.obs" in R's cor function documented here.

What should my next steps be? It would be great to be able to contribute what seems to be an improved version of corkendall to StatsBase, by raising a PR which gets accepted.

Philip

https://github.com/PGS62/KendallTau.jl

@nalimilan
Copy link
Member

nalimilan commented Jan 20, 2021

Thanks for the detailed explanation and additional checks!

Regarding mergesort!, I wonder whether we could use the standard sort! function from Base, but passing it a custom isless function that would increase a counter when !isless(x, y) to count the number of swaps. Do you think that could work? I'm asking because duplicating sort functions increases the risk of bugs in corner cases and we don't have a lot of workforce to check and maintain this.

Regarding the issue of NaN values, it looks like you're using them to represent missing values? #627 supports skipmissing=:listwise which is equivalent to use = "pairwise.complete.obs" in R. My goal is that instead of duplicating code to compute pairwise correlations or other functions, we use a common code in pairwise. We could add specialized pairwise methods for corkendall to take advantage of your sortperm trick (but that requires JuliaLang/julia#32310 to be able to dispatch on eachcol(mat)).

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 21, 2021

Well, I implemented two sorting functions: insertionsort! and mergesort! and only after implementing them did I look at equivalent code in base/sort.jl. That was a mistake. So I've reworked the code for each of those two methods by copying code from base, pasting to rankcorr.jl and then making the minimal edits to arrange that the functions return the "bubblesort distance".

The upshot was a further speedup and quite a big drop in memory usage. KendalTau.corkendall now uses ~30% less memory than StatsBase.corrkendall. See results for v1.3 here.

As for your idea to use sort! with a custom isless function I think that might work to eliminate insertionsort!, since we need to increase the counter by one whenever isless returns true. Presumably isless must return a Boolean. How should it also increment a counter? Could you explain please?

By contrast, I don't think the isless idea can work to eliminate mergesort!. That's because isless is (presumably) passed two values, but just knowing that pair of values is not enough to know by how many to increment the counter.

So we would end up with at least one copied-pasted-edited function (bad) but in mitigation the existing StatsBase.corkendall already has that. It has an implementation of the mergesort algorithm cunningly disguised as the function swaps!. So the count of unpleasantly duplicated functions need not go up!

On NaN values, I haven't made a start and it's good to hear that skipmissing=:listwise is in the works.

@nalimilan
Copy link
Member

As for your idea to use sort! with a custom isless function I think that might work to eliminate insertionsort!, since we need to increase the counter by one whenever isless returns true. Presumably isless must return a Boolean. How should it also increment a counter? Could you explain please?

I haven't thought too much about it, but something like this might work:

counter = 0
function myisless(x, y)
    x = isless(x, y)
    counter += x
    return x
end

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 22, 2021

Ah, thanks, easier than I thought. But there's a but...

As of yesterday, insertionsort! looked like this:

Version 1

"""
    insertionsort!(v::AbstractVector, lo::Integer, hi::Integer)

Mutates `v` by sorting elements `x[lo:hi]` using the insertionsort algorithm. 
This method is a copy-paste-edit of sort! in base/sort.jl (the method specialised on InsertionSortAlg),
but amended to return the bubblesort distance.
"""
function insertionsort!(v::AbstractVector, lo::Integer, hi::Integer)
    if lo == hi return 0 end
    nswaps = 0
    @inbounds for i = lo + 1:hi
        j = i
        x = v[i]
        while j > lo
            if x < v[j - 1]
                nswaps += 1
                v[j] = v[j - 1]
                j -= 1
                continue
            end
            break
        end
        v[j] = x
    end
    return nswaps
end

So, following your suggestion, I switched to insertionsort! being a wrap of sort!:

Version 2

function insertionsort!(v::AbstractVector, lo::Integer, hi::Integer)
    nswaps = 0
    function myisless(x, y)
        x = isless(x, y)
        nswaps += x
        return x
    end
    sort!(view(v, lo:hi), alg=Base.Sort.InsertionSort, lt=myisless)
    return nswaps
end

Good news is that the version 2 code passes all tests. Bad news is that the change kills performance. Instead of being >6 times faster than StatsBase.corkendall it's only 1.5 times faster, and instead of using 40% less memory, it uses 40% more.

Version 1

julia> KendallTau.speedtest([StatsBase.corkendall,KendallTau.corkendall],2000,10)
###################################################################
Executing speedtest 2021-01-22T15:55:37.676
--------------------------------------------------
size(matrix1) = (2000, 10)
StatsBase.corkendall(matrix1)
  34.889 ms (451 allocations: 5.54 MiB)
Main.KendallTau.corkendall(matrix1)
  5.602 ms (298 allocations: 3.40 MiB)
Speed ratio Main.KendallTau.corkendall vs StatsBase.corkendall: 6.227438240753963
Ratio of memory allocated Main.KendallTau.corkendall vs StatsBase.corkendall: 0.6130525086357451
Results from all 2 functions identical? true
--------------------------------------------------

Version 2

julia> KendallTau.speedtest([StatsBase.corkendall,KendallTau.corkendall],2000,10)
###################################################################
Executing speedtest 2021-01-22T15:51:10.97
--------------------------------------------------
size(matrix1) = (2000, 10)
StatsBase.corkendall(matrix1)
  35.072 ms (451 allocations: 5.54 MiB)
KendallTau.corkendall(matrix1)
  22.775 ms (294933 allocations: 7.89 MiB)
Speed ratio KendallTau.corkendall vs StatsBase.corkendall: 1.5399234042706402
Ratio of memory allocated KendallTau.corkendall vs StatsBase.corkendall: 1.4246610435616183
Results from all 2 functions identical? true
--------------------------------------------------

I don't understand why this performance drop-off happens, though I do remember reading that keyword arguments suffer a performance penalty in Julia, but I also thought that had been fixed. If you have any inspiration then let me know.

Otherwise I feel I'm getting to the point where I have done as good a job as I can of providing a better version of corkendall.

Relative to StatsBase.corkendall it is:

  1. Between 3 and 7 times faster.
  2. Uses about 40% less memory (as reported by @btime)
  3. Comes equipped with a thorough test harness against a naive implementation.

On the other hand:

  1. It involves undesirable copy-paste-edit of two methods of sort!
  2. It's written by an individual (me) without a track record of contributing to Julia.

What are your thoughts?

Philip

@nalimilan
Copy link
Member

Ah, that's due to nswaps being boxed since it's used in an anonymous function. Doing nswaps::Int = 0 improves performance. But this version seems to be almost as fast as without the custom isless:

function insertionsort!(v::AbstractVector, lo::Integer, hi::Integer)
    nswaps = Ref(0)
    function myisless(x, y)
        x = isless(x, y)
        nswaps[] += x
        return x
    end
    sort!(view(v, lo:hi), alg=Base.Sort.InsertionSort, lt=myisless)
    return nswaps[]
end

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 22, 2021

Thank you for spotting that. I read about boxing ages ago, but had obviously forgotten.

So if we call that version 3, we can compare against version 1:

using BenchmarkTools
data1 = rand(1:100,20)
data3 = copy(data1)
@btime insertionsort_v1!(data1,1,length(data1))
@btime insertionsort_v3!(data3,1,length(data3))
data1 == data3

  57.699 ns (0 allocations: 0 bytes)
  89.967 ns (1 allocation: 16 bytes)
true

So version 3 is about 1.5 times slower than version 1 when operating on 20 element vectors (which is where I have SMALL_THRESHOLD now, in line with base/sort.jl). It's not obvious how much that matters in terms of impact on the speed of corkendall but unfortunately it seems to matter a fair amount:

With version 1

###################################################################
Executing speedtest 2021-01-22T18:14:42.016       
--------------------------------------------------
size(matrix1) = (2000, 10)
StatsBase.corkendall(matrix1)
  35.161 ms (451 allocations: 5.54 MiB)
Main.KendallTau.corkendall(matrix1)
  5.401 ms (298 allocations: 3.40 MiB)
Speed ratio Main.KendallTau.corkendall vs StatsBase.corkendall: 6.50995167651034
Ratio of memory allocated Main.KendallTau.corkendall vs StatsBase.corkendall: 0.6130525086357451
Results from all 2 functions identical? true
--------------------------------------------------

versus with version 3:

###################################################################
Executing speedtest 2021-01-22T18:18:22.288
--------------------------------------------------
size(matrix1) = (2000, 10)
StatsBase.corkendall(matrix1)
  33.784 ms (451 allocations: 5.54 MiB)
Main.KendallTau.corkendall(matrix1)
  8.368 ms (1738 allocations: 3.42 MiB)
Speed ratio Main.KendallTau.corkendall vs StatsBase.corkendall: 4.037430087480281
Ratio of memory allocated Main.KendallTau.corkendall vs StatsBase.corkendall: 0.6170191666712577
Results from all 2 functions identical? true
--------------------------------------------------

So a 6.5 times speedup versus a 4 times speedup on the basis of these test parameters. That needs to be set against your earlier statement:

I'm asking because duplicating sort functions increases the risk of bugs in corner cases and we don't have a lot of workforce to check and maintain this.

FWIW my view is that Base.sort! must be battle-hardened by now and it's a simple algorithm. I find it difficult to imagine a corner-cases popping up for which a bug becomes apparent. Therefore copying it with a small edit seems unlikely to cause problems in the future. It seems a shame to leave additional performance on the table by not going for version 1.

@nalimilan
Copy link
Member

Too bad. It would probably be possible to avoid the overhead by making myisless a custom struct which wraps the value, creating it in corkendall!, and passing that to mergesort! to avoid reallocating a Ref on each call to insertionsort!. But that wouldn't give a really simpler code. What would really reduce duplication would be to be able to call the merge sort implementation from Base, but as you noted I'm not sure that's possible...

I think it's fair to assume that Base.sort! has been well tested, but it might still improve in the future. For example it's not unlikely that merge sort will use multithreading at some point since that's not too hard to do. Somebody will have to remember that we have an internal copy of the code an update it if that's the case.

Can you please make a PR with your best implementation? Then I'll make more detailed comments. Just as a first remark, I've noticed several cases of type instabilities where a variable is initialized as 0 but is then added a non-integer value. Also you can use @inbounds after checking vector bounds to get a bit more performance.

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 23, 2021

Might it be possible to add comments in base/sort.jl about the (slightly shameful) existence of copied code as a reminder that enhancements could usefully be copied across? I noticed that you are the last person to commit that file.

I have made changes to try to avoid type instability. The variables are still initialised as 0 but now when I add to them I do something like ntiesx += div(k * (k + 1) , 2) which keeps things as integers I think. I've also added some @inbounds as you suggested with some safety checks beforehand - you may wish to inspect what I've done there.

As you suspected, those changes do provide some additional speed. That takes the final speed improvement to about x 4 if both arguments are vectors and x 7 if one or both are matrices. Quite pleasing...

I also investigated whether the correlation of the input x and y vectors influence the speed of corkendall. It doesn't; I think because the speed of the mergesort algorithm is independent of the initial ordering of the input vector.

I will make a PR, but one question: Would it make sense for me to include my changes to test/rankcorr.jl? That's where I put my corkendallnaive together with the function compare_implementations in turn called by @test at the last line of the file. I found this setup invaluable during development, to get confidence that the code was correct. But I worry that it might be overly-complicated for other developers to deal with.

Many thanks for the guidance you have given these last few days.

Philip

@nalimilan
Copy link
Member

Might it be possible to add comments in base/sort.jl about the (slightly shameful) existence of copied code as a reminder that enhancements could usefully be copied across? I noticed that you are the last person to commit that file.

I'm afraid not. While StatsBase is kind of an essential Julia package, we can't add comments to the Julia codebase about all packages that would decide to copy parts of it.

I will make a PR, but one question: Would it make sense for me to include my changes to test/rankcorr.jl? That's where I put my corkendallnaive together with the function compare_implementations in turn called by @test at the last line of the file. I found this setup invaluable during development, to get confidence that the code was correct. But I worry that it might be overly-complicated for other developers to deal with.

Good question... I'm not completely sure, but I'd say it's simpler to hardcode the result that you know is correct for a series of input values that cover all common and corner cases.

Thanks for working on this.

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 24, 2021

Yes, on reflection much better to hard code correct results. I needed to ensure that the vectors tested had lengths greater than SMALL_THRESHOLD or else method mergesort! would not have been properly tested at all.

I've just made the PR. My first to any open source Julia package...

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 25, 2021

I imagine you've seen this, but after I submitted my PR yesterday I received an email "CI: Some jobs were not successful" with a link to here.

I'm looking into what went wrong, and I do think there's a bug in the code.

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 25, 2021

Rookie error I'm afraid.

The last line of corkendall! read:

 (npairs + ndoubleties - ntiesx - ntiesy - 2 * nswaps) /
     sqrt((npairs - ntiesx) * (npairs - ntiesy))

Taking the simple case where the numbers of ties are zero, that will fail if npairs^2 is greater than the maximum allowed integer. On my 64 bit PC that translates to a maximum vector length of 77936. On a 32 bit OS it translates to a maximum vector length of only 304. My tests had a maximum vector length of 8,000 so passed on 64 bit but failed on 32 bit.

So the fix I propose is to change that line to:

 (npairs + ndoubleties - ntiesx - ntiesy - 2 * nswaps) /
     sqrt(float(npairs - ntiesx) * float(npairs - ntiesy))

but I'm also concerned about lines such as:

ntiesx += div(k * (k + 1), 2)

which I think could cause errors on a 32 bit os if an element appears more than 64k times in one of the input vectors.

By the way it was pure luck that I decided to test on vectors as long as 8,000 - what was in my mind was to test on vectors longer than 64 since that's where I have SMALL_THRESHOLD - i.e. I was thinking about an entirely different issue.

@PGS62
Copy link
Contributor Author

PGS62 commented Jan 26, 2021

I think I have it fixed now and will try another PR later today. I discovered that the existing StatsBase implementation of corkendall also fails on 32 bit:

x = [1,2,3,4,5];
y = [5,3,4,2,5];
julia>StatsBase.corkendall(x, y)  -1 / sqrt(90)
true
julia>StatsBase.corkendall(repeat(x,9268), repeat(y,9268))  -1 / sqrt(90)
true
julia>StatsBase.corkendall(repeat(x,9269), repeat(y,9269))  -1 / sqrt(90)
ERROR: DomainError with -1.288340038e9:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(::Symbol, ::Float64) at .\math.jl:33
 [2] sqrt at .\math.jl:573 [inlined]
 [3] sqrt at .\math.jl:599 [inlined]
 [4] corkendall!(::Array{Float64,1}, ::Array{Float64,1}) at C:\Users\phili\.julia\packages\StatsBase\EA8Mh\src\rankcorr.jl:93
 [5] corkendall(::Array{Int32,1}, ::Array{Int32,1}) at C:\Users\phili\.julia\packages\StatsBase\EA8Mh\src\rankcorr.jl:103
 [6] top-level scope at REPL[183]:1

nalimilan pushed a commit that referenced this issue Feb 8, 2021
New version of corkendall is approx 4 times faster if both arguments are vectors and 7 times faster if at least one is a matrix. See issue #634 for details.
@PGS62 PGS62 closed this as completed Apr 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants