-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up Kendall Correlation corkendall
#634
Comments
Well without more details about potential downsides, it's hard to say "no" to this kind of offer. :-) Please copy the code somewhere or make a PR if you want more specific comments. |
OK, I will do that within the next few days. The code is already in GitHub in a private repository of mine, so when I'm happy to share it I will make the repo public, or I could make a pull request. It's fair to raise the question about potential downsides. I don't think there are any downsides to my re-write, but it's always possible. For example, I need to look into performance testing against slightly "pathological" input data such as that with very large numbers of repeats. Another question is memory usage. My version does allocate more memory (as reported by Since making my previous post I have written code that can compare the execution speed and results from arbitrary implementations of Kendall Tau, against randomly generated input data. That testing revealed some bugs in my code (no surprise, now fixed) but also a couple of minor ones in the existing code. For example,
The fix for that problem is at line 113, which currently reads:
but should read:
Also the function can return values outside the range -1 to 1, which is less than ideal:
I think a fix for that is to change line 93 from reading:
to instead read:
which is obviously the same mathematically. The only way Kendall Tau can be one is for the ordering of I have also been looking at using threads to get yet better speed in the function, but so far I've been rather disappointed by my results - about a two-fold speedup on an eight core machine - much less than I was hoping for... I don't even know whether putting multi-threaded routines in StatsBase is considered a good idea. Perhaps best to try walking before running when it comes to making contributions to the Julia code base... |
Hello again, I have made the code public on GitHub (public, but not registered) Perhaps a member of StatsBase could have a look and let me know their thoughts. Philip |
Thanks. Looks interesting. A few remarks:
Regarding multithreading, it could be an interesting addition to a more general |
Here goes:
In some senses the new algorithm is the same. The aim is to calculate:
Having those five numbers yields Kendall Tau via Knight's observation that Tau = (npairs + ndoubleties - ntiesx - ntiesy - 2 * nswaps) / What does "swaps" mean? The answer is how many swap operations would the (inefficient) bubble sort algorithm require to morph x's ranking into y's ranking. More concretely think of first forming the two column array The new algorithm is faster for two principal reasons: Reason 1) The new Reason 2) But the new code is also faster in the vector-vector case, when reason 1 doesn't apply. This is because in the new By contrast the old code doesn't play this "double-duty" trick. It does three sorting operations, at lines 44 and 47 (obvious) but also at line 93 (not so obvious) in the call to
I'm not really sure how to investigate that. Can you point me to a web page this some hints?
I think my ramblings above answered this. I'm using my implementation of mergesort because it does "double duty", which the julia implementation does not (why would it?). I did look at the implementation of mergesort in Julia but only after I wrote my implementation after looking at the Wikipedia page https://en.wikipedia.org/wiki/Merge_sort.
No, but I will have a look soon. |
An update. Inspired by the mergesort implementation in Julia base (line 581 here) I have reduced memory usage (by which I mean total allocations reported by I also discovered that the default value of I also investigated the impact of the "density of repeats" as @nalimilan suggested. This is via a new test harness called Here are the results
I think these results look good. The new code is consistently faster, by a factor of 2.5 when both vectors contain values only in the range One more thing. I mention here one deficiency of What should my next steps be? It would be great to be able to contribute what seems to be an improved version of Philip |
Thanks for the detailed explanation and additional checks! Regarding Regarding the issue of |
Well, I implemented two sorting functions: The upshot was a further speedup and quite a big drop in memory usage. As for your idea to use By contrast, I don't think the So we would end up with at least one copied-pasted-edited function (bad) but in mitigation the existing On |
I haven't thought too much about it, but something like this might work: counter = 0
function myisless(x, y)
x = isless(x, y)
counter += x
return x
end |
Ah, thanks, easier than I thought. But there's a but... As of yesterday, Version 1 """
insertionsort!(v::AbstractVector, lo::Integer, hi::Integer)
Mutates `v` by sorting elements `x[lo:hi]` using the insertionsort algorithm.
This method is a copy-paste-edit of sort! in base/sort.jl (the method specialised on InsertionSortAlg),
but amended to return the bubblesort distance.
"""
function insertionsort!(v::AbstractVector, lo::Integer, hi::Integer)
if lo == hi return 0 end
nswaps = 0
@inbounds for i = lo + 1:hi
j = i
x = v[i]
while j > lo
if x < v[j - 1]
nswaps += 1
v[j] = v[j - 1]
j -= 1
continue
end
break
end
v[j] = x
end
return nswaps
end
So, following your suggestion, I switched to Version 2 function insertionsort!(v::AbstractVector, lo::Integer, hi::Integer)
nswaps = 0
function myisless(x, y)
x = isless(x, y)
nswaps += x
return x
end
sort!(view(v, lo:hi), alg=Base.Sort.InsertionSort, lt=myisless)
return nswaps
end Good news is that the version 2 code passes all tests. Bad news is that the change kills performance. Instead of being >6 times faster than Version 1 julia> KendallTau.speedtest([StatsBase.corkendall,KendallTau.corkendall],2000,10)
###################################################################
Executing speedtest 2021-01-22T15:55:37.676
--------------------------------------------------
size(matrix1) = (2000, 10)
StatsBase.corkendall(matrix1)
34.889 ms (451 allocations: 5.54 MiB)
Main.KendallTau.corkendall(matrix1)
5.602 ms (298 allocations: 3.40 MiB)
Speed ratio Main.KendallTau.corkendall vs StatsBase.corkendall: 6.227438240753963
Ratio of memory allocated Main.KendallTau.corkendall vs StatsBase.corkendall: 0.6130525086357451
Results from all 2 functions identical? true
--------------------------------------------------
Version 2 julia> KendallTau.speedtest([StatsBase.corkendall,KendallTau.corkendall],2000,10)
###################################################################
Executing speedtest 2021-01-22T15:51:10.97
--------------------------------------------------
size(matrix1) = (2000, 10)
StatsBase.corkendall(matrix1)
35.072 ms (451 allocations: 5.54 MiB)
KendallTau.corkendall(matrix1)
22.775 ms (294933 allocations: 7.89 MiB)
Speed ratio KendallTau.corkendall vs StatsBase.corkendall: 1.5399234042706402
Ratio of memory allocated KendallTau.corkendall vs StatsBase.corkendall: 1.4246610435616183
Results from all 2 functions identical? true
--------------------------------------------------
I don't understand why this performance drop-off happens, though I do remember reading that keyword arguments suffer a performance penalty in Julia, but I also thought that had been fixed. If you have any inspiration then let me know. Otherwise I feel I'm getting to the point where I have done as good a job as I can of providing a better version of Relative to
On the other hand:
What are your thoughts? Philip |
Ah, that's due to function insertionsort!(v::AbstractVector, lo::Integer, hi::Integer)
nswaps = Ref(0)
function myisless(x, y)
x = isless(x, y)
nswaps[] += x
return x
end
sort!(view(v, lo:hi), alg=Base.Sort.InsertionSort, lt=myisless)
return nswaps[]
end |
Thank you for spotting that. I read about boxing ages ago, but had obviously forgotten. So if we call that version 3, we can compare against version 1: using BenchmarkTools
data1 = rand(1:100,20)
data3 = copy(data1)
@btime insertionsort_v1!(data1,1,length(data1))
@btime insertionsort_v3!(data3,1,length(data3))
data1 == data3
57.699 ns (0 allocations: 0 bytes)
89.967 ns (1 allocation: 16 bytes)
true
So version 3 is about 1.5 times slower than version 1 when operating on 20 element vectors (which is where I have SMALL_THRESHOLD now, in line with base/sort.jl). It's not obvious how much that matters in terms of impact on the speed of With version 1
versus with version 3:
So a 6.5 times speedup versus a 4 times speedup on the basis of these test parameters. That needs to be set against your earlier statement:
FWIW my view is that |
Too bad. It would probably be possible to avoid the overhead by making I think it's fair to assume that Can you please make a PR with your best implementation? Then I'll make more detailed comments. Just as a first remark, I've noticed several cases of type instabilities where a variable is initialized as |
Might it be possible to add comments in base/sort.jl about the (slightly shameful) existence of copied code as a reminder that enhancements could usefully be copied across? I noticed that you are the last person to commit that file. I have made changes to try to avoid type instability. The variables are still initialised as As you suspected, those changes do provide some additional speed. That takes the final speed improvement to about x 4 if both arguments are vectors and x 7 if one or both are matrices. Quite pleasing... I also investigated whether the correlation of the input I will make a PR, but one question: Would it make sense for me to include my changes to test/rankcorr.jl? That's where I put my Many thanks for the guidance you have given these last few days. Philip |
I'm afraid not. While StatsBase is kind of an essential Julia package, we can't add comments to the Julia codebase about all packages that would decide to copy parts of it.
Good question... I'm not completely sure, but I'd say it's simpler to hardcode the result that you know is correct for a series of input values that cover all common and corner cases. Thanks for working on this. |
Yes, on reflection much better to hard code correct results. I needed to ensure that the vectors tested had lengths greater than SMALL_THRESHOLD or else method I've just made the PR. My first to any open source Julia package... |
I imagine you've seen this, but after I submitted my PR yesterday I received an email "CI: Some jobs were not successful" with a link to here. I'm looking into what went wrong, and I do think there's a bug in the code. |
Rookie error I'm afraid. The last line of
Taking the simple case where the numbers of ties are zero, that will fail if So the fix I propose is to change that line to:
but I'm also concerned about lines such as:
which I think could cause errors on a 32 bit os if an element appears more than 64k times in one of the input vectors. By the way it was pure luck that I decided to test on vectors as long as 8,000 - what was in my mind was to test on vectors longer than 64 since that's where I have SMALL_THRESHOLD - i.e. I was thinking about an entirely different issue. |
I think I have it fixed now and will try another PR later today. I discovered that the existing StatsBase implementation of x = [1,2,3,4,5];
y = [5,3,4,2,5];
julia>StatsBase.corkendall(x, y) ≈ -1 / sqrt(90)
true
julia>StatsBase.corkendall(repeat(x,9268), repeat(y,9268)) ≈ -1 / sqrt(90)
true
julia>StatsBase.corkendall(repeat(x,9269), repeat(y,9269)) ≈ -1 / sqrt(90)
ERROR: DomainError with -1.288340038e9:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] throw_complex_domainerror(::Symbol, ::Float64) at .\math.jl:33
[2] sqrt at .\math.jl:573 [inlined]
[3] sqrt at .\math.jl:599 [inlined]
[4] corkendall!(::Array{Float64,1}, ::Array{Float64,1}) at C:\Users\phili\.julia\packages\StatsBase\EA8Mh\src\rankcorr.jl:93
[5] corkendall(::Array{Int32,1}, ::Array{Int32,1}) at C:\Users\phili\.julia\packages\StatsBase\EA8Mh\src\rankcorr.jl:103
[6] top-level scope at REPL[183]:1
|
I need to calculate the Kendall correlation of an array of size approximately 1,000 x 10,000 and calculate some statistics of the resulting 10,000 x 10,000 correlation matrix.
The current implementation of
corkendall
, even though it uses Knight's algorithm, is quite slow in this case. I estimate about five hours on my PC. I hope I'll be able to get that time down using multiple threads. Should be straightforward no?But first I had a go at improving the current not-threaded implementation and have achieved a 4 to 5 times speed improvement. I think the tests could be beefed up a bit too. An obvious way to do that is to test against a very simple order N^2 implementation.
Would a pull request containing my suggested replacement of the current
corkendall
likely be accepted?If it would help, I could make my code available as a standalone package as an interim step.
Thanks,
Philip
The text was updated successfully, but these errors were encountered: