-
Notifications
You must be signed in to change notification settings - Fork 11
Conversation
Replace all occurrences of the term and rename files accordingly. Also remove NEWS.md and update README.md with new links.
This reverts commit 15e5da6.
Nice! Any idea whether performance improves further using |
Cool! Since you have written benchmarks, it would be great to use the PkgBenchmark.jl framework and add them to the repo. WRT sorting, I would have a look at what e.g. dplyr does. I would think the number of groups is relatively small in general, so that sorting by default doesn't have a large performance impact? The case of categorical arrays is very special. A bit off-topic, but I wonder whether we should allow comparisons for any categorical arrays, ordered or not. Indeed, strings can be tested for order, so you're not protected for meaningless comparisons on nominal variables after importing them from CSV or a database either. If we do that, the ordered/unordered distinction would lose most of its interest, which would simplify the code a bit. I think my preferred solution would be to stop supporting |
The majority of the time is spent on sorting. If we drop sorting by default then the savings are 50-90% julia> head(small)
6×4 DataTables.DataTable
│ Row │ A │ B │ C │ D │
├─────┼───────┼────┼───┼───┤
│ 1 │ #NULL │ :a │ b │ b │
│ 2 │ #NULL │ :b │ b │ b │
│ 3 │ 4 │ :b │ b │ b │
│ 4 │ 2 │ :a │ b │ b │
│ 5 │ 2 │ :a │ a │ a │
│ 6 │ #NULL │ :a │ a │ a │
julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
memory estimate: 84.78 KiB
allocs estimate: 2022
--------------
minimum time: 229.404 μs (0.00% GC)
median time: 277.827 μs (0.00% GC)
mean time: 308.226 μs (4.49% GC)
maximum time: 5.708 ms (91.73% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(small, [:A, :B], sort=false)
BenchmarkTools.Trial:
memory estimate: 24.45 KiB
allocs estimate: 780
--------------
minimum time: 29.001 μs (0.00% GC)
median time: 33.321 μs (0.00% GC)
mean time: 39.536 μs (9.36% GC)
maximum time: 4.111 ms (97.13% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> head(large)
6×6 DataTables.DataTable
│ Row │ A │ B │ C │ D │ E │ F │
├─────┼──────────┼──────────┼──────────┼────────────┼──────────┼───────────┤
│ 1 │ 0.371919 │ 0.351571 │ 0.839413 │ 0.00928037 │ 0.637762 │ 0.911623 │
│ 2 │ 0.371919 │ 0.351571 │ 0.839413 │ 0.00928037 │ 0.637762 │ 0.552918 │
│ 3 │ 0.203329 │ 0.227815 │ 0.776044 │ 0.0780413 │ 0.326902 │ 0.0208977 │
│ 4 │ 0.203329 │ 0.227815 │ 0.776044 │ 0.0780413 │ 0.326902 │ 0.326447 │
│ 5 │ 0.219975 │ 0.645889 │ 0.890473 │ 0.638325 │ 0.772582 │ 0.691228 │
│ 6 │ 0.219975 │ 0.645889 │ 0.890473 │ 0.638325 │ 0.772582 │ 0.290886 │
julia> @benchmark groupby(large, [:A, :B])
BenchmarkTools.Trial:
memory estimate: 121.64 MiB
allocs estimate: 3626397
--------------
minimum time: 370.119 ms (7.40% GC)
median time: 392.355 ms (9.03% GC)
mean time: 396.227 ms (10.19% GC)
maximum time: 479.304 ms (26.35% GC)
--------------
samples: 13
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large, [:A, :B], sort=false)
BenchmarkTools.Trial:
memory estimate: 26.73 MiB
allocs estimate: 1193286
--------------
minimum time: 41.327 ms (9.48% GC)
median time: 47.171 ms (9.35% GC)
mean time: 47.492 ms (10.68% GC)
maximum time: 60.089 ms (17.36% GC)
--------------
samples: 106
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
memory estimate: 125.58 MiB
allocs estimate: 4912160
--------------
minimum time: 274.978 ms (10.81% GC)
median time: 296.645 ms (11.31% GC)
mean time: 300.621 ms (12.86% GC)
maximum time: 384.436 ms (32.78% GC)
--------------
samples: 17
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large, [:A, :B, :C, :D, :E], sort=false)
BenchmarkTools.Trial:
memory estimate: 57.25 MiB
allocs estimate: 2681427
--------------
minimum time: 80.231 ms (14.07% GC)
median time: 92.527 ms (16.50% GC)
mean time: 94.756 ms (18.43% GC)
maximum time: 164.173 ms (53.89% GC)
--------------
samples: 53
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% not sorting also fixes the categorical issue. julia> df2 = DataTable(v1 = categorical(collect(1:1000)), v2 = categorical(fill(1, 1000)));
julia> groupby(df2, [:v1, :v2], sort=false)
DataTables.GroupedDataTable 1000 groups with keys: Symbol[:v1,:v2]
First Group:
1×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ v1 │ v2 │
├─────┼─────┼────┤
│ 1 │ 644 │ 1 │
⋮
Last Group:
1×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ v1 │ v2 │
├─────┼─────┼────┤
│ 1 │ 181 │ 1 │ I'm tracking the groups with a |
Now retains ordering and checks for categoricals julia> df2 = DataTable(v1 = categorical(collect(1:1000)), v2 = categorical(fill(1, 1000)));
julia> groupby(df2, [:v1, :v2])
DataTables.GroupedDataTable 1000 groups with keys: Symbol[:v1,:v2]
First Group:
1×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ v1 │ v2 │
├─────┼────┼────┤
│ 1 │ 1 │ 1 │
⋮
Last Group:
1×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ v1 │ v2 │
├─────┼──────┼────┤
│ 1 │ 1000 │ 1 │ |
I think that's a great idea but I'd like to make that a separate commit. I'd also like to make sure everything works before worrying about automated benchmarking.
Given the serious speedup of not sorting, I've swapped sorting to be off by default. What are everyone's thoughts on not sorting in groupby at all? I implemented sorting to maintain consistency with the current implementation, but @kprybol raised the valid point that because the user can pre-sort before groupby, sorting within groupby is functional redundancy and probably should not be part of this function
This too can be addressed by letting the user pre-sort the DataTable as they see fit, although I'm currently disabling sort if CategoricalArrays are present
|
current julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
memory estimate: 11.95 KiB
allocs estimate: 185
--------------
minimum time: 37.121 μs (0.00% GC)
median time: 43.135 μs (0.00% GC)
mean time: 44.989 μs (2.79% GC)
maximum time: 2.659 ms (97.18% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large, [:A, :B])
BenchmarkTools.Trial:
memory estimate: 1.26 GiB
allocs estimate: 388362
--------------
minimum time: 1.227 s (13.77% GC)
median time: 1.288 s (19.05% GC)
mean time: 1.277 s (18.02% GC)
maximum time: 1.303 s (19.08% GC)
--------------
samples: 4
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
ERROR: InexactError()
in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at ./array.jl:415
in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:146
in ##core#281() at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:283
in ##sample#282(::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:289
in #_run#3(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:317
in (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at ./<missing>:0
in anonymous at ./<missing>:?
in #run_result#16(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:33
in (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at ./<missing>:0
in #run#17(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:36
in (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at ./<missing>:0
in warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:71 proposed julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
memory estimate: 19.86 KiB
allocs estimate: 670
--------------
minimum time: 24.267 μs (0.00% GC)
median time: 27.229 μs (0.00% GC)
mean time: 32.302 μs (8.62% GC)
maximum time: 3.849 ms (97.77% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large, [:A, :B])
BenchmarkTools.Trial:
memory estimate: 23.51 MiB
allocs estimate: 1032444
--------------
minimum time: 36.850 ms (7.93% GC)
median time: 40.266 ms (9.30% GC)
mean time: 40.605 ms (9.79% GC)
maximum time: 55.644 ms (8.94% GC)
--------------
samples: 124
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
memory estimate: 50.27 MiB
allocs estimate: 2340416
--------------
minimum time: 75.170 ms (10.23% GC)
median time: 81.381 ms (13.28% GC)
mean time: 83.314 ms (14.56% GC)
maximum time: 138.363 ms (49.04% GC)
--------------
samples: 60
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% just removed: sorting within groupby (as it was a few commits ago) julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
memory estimate: 125.58 MiB
allocs estimate: 4912160
--------------
minimum time: 274.978 ms (10.81% GC)
median time: 296.645 ms (11.31% GC)
mean time: 300.621 ms (12.86% GC)
maximum time: 384.436 ms (32.78% GC)
--------------
samples: 17
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% a reason why it was removed: faster to presort and feed to non-sorting groupby julia> @benchmark groupby(sort(large), [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
memory estimate: 93.92 MiB
allocs estimate: 4158227
--------------
minimum time: 112.816 ms (10.97% GC)
median time: 120.869 ms (14.16% GC)
mean time: 122.530 ms (15.05% GC)
maximum time: 186.425 ms (46.27% GC)
--------------
samples: 41
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% |
src/groupeddatatable/grouping.jl
Outdated
row = DataTableRow(intersect, i) | ||
if !haskey(mappings, row) | ||
mappings[row] = [i] | ||
elseif haskey(mappings, row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The haskey
check is redundant.
test/grouping.jl
Outdated
@@ -38,7 +38,7 @@ module TestGrouping | |||
@test groupby(df2, [:v2, :v1]).starts == collect(1:1000) | |||
|
|||
# grouping empty frame | |||
@test groupby(DataTable(A=Int[]), :A).starts == Int[] | |||
# @test groupby(DataTable(A=Int[]), :A).starts == Int[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you comment out these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This re-write doesn't support empty DataTables
julia> groupby(DataTable(A=Int[]), :A).starts
ERROR: BoundsError: attempt to access 0-element Array{Array{Int64,1},1} at index [1]
in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:136
in groupby(::DataTables.DataTable, ::Symbol) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:147
I could add a check at the top of the function if nrow(d) == 0...
but I don't see any reason to. Throwing an error seems like a good idea here. If someone is trying to groupby
a DataTable then they probably also expect that DataTable to have some values in it.
end | ||
(idx, starts) = groupsort_indexer(x, ngroups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also remove the function above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is also used in join
. Should we move the function to that file since it's no longer used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be more logical, yes.
How is it possible that sorting beforehand is much faster? Doesn't that mean the sorting implementation was sub-optimal? I would have thought sorting a Anyway, I would keep the keyword argument even if set to false by default: with real data sets, you have much more variables, and it's annoying to have to repeat the list of variables according to which sorting must be done. Even though dplyr does not seem to support this, Pandas does, and sorts by default. As regards unordered categorical arrays, I'd be inclined to support Could you say a word about why your implementation is so much faster than the old one? Is that still the case when grouping on categorical arrays? I'm wondering because the code for |
redundant elseif check
Yes, the sorting implementation I wrote was definitely suboptimal. I also don't know how to speed it up (besides just pre-sorting). The issue is that DataTables don't keep track of the original row index (to my knowledge), so when I sort each group I have to make a copy of the DataTable subset (the group), add a new column to keep track of the original rows, and then I use that column post-sort to re-order my indices. Here's an example that shows the sorting process (you'll have to imagine it's a real group). julia> d = DataTable(A = collect('z':-1:'A'));
julia> temp = head(d)
6×1 DataTables.DataTable
│ Row │ A │
├─────┼─────┤
│ 1 │ 'z' │
│ 2 │ 'y' │
│ 3 │ 'x' │
│ 4 │ 'w' │
│ 5 │ 'v' │
│ 6 │ 'u' │
# let's pretend we have a group and the members of that group are at these rows
julia> group_indices = [1, 5, 8, 10]
4-element Array{Int64,1}:
1
5
8
10
julia> temp = d[group_indices, :]
4×1 DataTables.DataTable
│ Row │ A │
├─────┼─────┤
│ 1 │ 'z' │
│ 2 │ 'v' │
│ 3 │ 's' │
│ 4 │ 'q' │
# I need to make a new column to track order, which I assume is making a copy of the DataTable
julia> temp[:rowid] = collect(1:nrow(temp))
4-element Array{Int64,1}:
1
2
3
4
julia> sort(temp)
4×2 DataTables.DataTable
│ Row │ A │ rowid │
├─────┼─────┼───────┤
│ 1 │ 'q' │ 4 │
│ 2 │ 's' │ 3 │
│ 3 │ 'v' │ 2 │
│ 4 │ 'z' │ 1 │
# I can now use the temporary column to reorder my indices and these get passed
# to the GroupedDataFrame constructor
julia> group_indices = group_indices[sort(temp)[:rowid]]
4-element Array{Int64,1}:
10
8
5
1 So, sorting groups during the function is VERY inefficient. And we cannot sort the grouped DataTable julia> small = DataTable(A = a, B = b, C = c, D = c);
julia> groupby(small, [:A, :B])
DataTables.GroupedDataTable 7 groups with keys: Symbol[:A,:B]
First Group:
3×4 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ A │ B │ C │ D │
├─────┼───────┼────┼───┼───┤
│ 1 │ #NULL │ :a │ b │ b │
│ 2 │ #NULL │ :a │ a │ a │
│ 3 │ #NULL │ :a │ b │ b │
⋮
Last Group:
1×4 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ A │ B │ C │ D │
├─────┼───┼────┼───┼───┤
│ 1 │ 2 │ :b │ b │ b │
julia> sort(ans)
ERROR: MethodError: no method matching sort(::DataTables.GroupedDataTable)
Closest candidates are:
sort(::AbstractUnitRange{T}) at range.jl:843
sort(::Range{T}) at range.jl:846
sort{Tv,Ti}(::SparseVector{Tv,Ti}; kws...) at sparse/sparsevector.jl:1716 There might be other options for how to sort within We could
I rewrote julia> A = repeat(rand(10), inner=1000);
julia> B = repeat(rand(10), inner=1000);
julia> C = repeat(rand(10), inner=1000);
julia> D = repeat(rand(10), inner=1000);
julia> E = repeat(rand(10), inner=1000);
julia> F = repeat(rand(10000));
julia> large = DataTable(A = A, B = B, C = C, D = D, E = E, F = F);
julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
memory estimate: 23.44 MiB
allocs estimate: 1094488
--------------
minimum time: 28.802 ms (8.28% GC)
median time: 30.470 ms (8.21% GC)
mean time: 31.400 ms (8.80% GC)
maximum time: 37.249 ms (8.10% GC)
--------------
samples: 160
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
Pandas sorts by default but it also does not automatically reset the row indices, which is precisely the limitation that prevents sorting from being efficient within my Python 3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul 2 2016, 17:52:12)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import Pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randn(6,4))
0 1 2 3
0 1.617402 1.279826 -1.272431 0.786087
1 -1.637442 1.152224 0.267500 -1.536474
2 0.969473 0.168263 1.022955 -0.769514
3 -0.386099 -0.493472 -1.106044 1.284097
4 -0.712302 1.951609 1.714415 -0.448146
5 1.586177 2.684248 -1.141071 2.184125
>>> d.index
RangeIndex(start=0, stop=6, step=1)
>>> d.sort(0)
0 1 2 3
0 -1.451303 -1.077141 -0.769869 0.244627
2 -1.324284 0.742099 -1.347156 2.949335
5 -0.114870 -0.156239 -0.435413 -0.980006
1 -0.020488 -0.791392 -0.412640 -2.084760
3 1.026864 1.214448 -0.279307 0.911531
4 1.987862 2.016643 1.352183 1.010149
>>> d.sort(0).index
Int64Index([0, 2, 5, 1, 3, 4], dtype='int64')
This sounds great. I agree that supporting a reasonable amount of sorting functionality of levels within
I can't speak as to why this is faster than the Pandas algorithm because I don't quite understand the Pandas algorithm. Maybe @johnmyleswhite could be more helpful? It seems that this implementation is performing fewer calculations than the Pandas algorithm. It only looks at each piece of relevant data once and keeps track of no more than is required to properly generate a call to |
As a humbling comparison, here's the same thing benchmarked in Python. import pandas as pd
import numpy as np
a = np.random.choice([1, 2, np.NaN, 4], 20)
b = np.random.choice(["a", "b"], 20)
c = np.random.choice([True, False], 20)
d = np.random.choice(np.random.rand(2), 20)
A = np.repeat(np.random.rand(8891), 2)
B = np.repeat(np.random.rand(8891), 2)
C = np.repeat(np.random.rand(8891), 2)
D = np.repeat(np.random.rand(8891), 2)
E = np.repeat(np.random.rand(8891), 2)
F = np.random.rand(17782)
small = pd.DataFrame({'A' : a, 'B' : b, 'C' : c, 'D' : d})
large = pd.DataFrame({'A' : A, 'B' : B, 'C' : C, 'D' : D, 'E' : E, 'F' : F})
small.head()
large.head() results In [11]: %timeit small.groupby(['A', 'B'])
The slowest run took 44.56 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 62.5 µs per loop
In [12]: %timeit large.groupby(['A', 'B'])
The slowest run took 6.38 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 62.4 µs per loop
In [13]: %timeit large.groupby(['A', 'B', 'C', 'D', 'E'])
The slowest run took 4.45 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 117 µs per loop so we probably should use the Pandas approach again in the future, but I think this rewrite is a suitable stepping stone to utilize for the time being |
Given the very good Pandas timings, I'm reluctant to move away from their algorithm without a deeper examination of what they are doing. Could you have a look at their code, and in particular at how they work around the limitation on the number of groups? I've looked a bit in their code base, and I finally found this function which seems to match our current algorithm. Looks like there's logic to stop before the number of groups gets too large, by dropping empty levels and starting again, but I haven't investigated enough to understand how that works.
What makes the Pandas algorithm more efficient is likely that for each row, only an array indexing operation is needed (repeated for each column), which is much cheaper than a dictionary lookup. The number of calculations does not mean much if you don't take into account how costly they are. That's why I asked for a benchmark when all columns are categorical arrays: in that case, the integer codes have already been computed and no dictionary lookup is needed at all. I would be surprised if your algorithm is faster than the current one in that case -- that would be an indication that something is clearly wrong in that code. When the input is not a categorical array, the existing code creates a new one, which is costly (and which I haven't benchmarked seriously yet, so could be a bottleneck). It would also be interesting to know what Pandas does for e.g. string columns: does it perform the conversion as we do, or does it use a different code path? |
The broken link in the existing function I think is supposed to reference this page which describes the high-level concepts of the algorithm design. Should help us make sense of the actual code.
Sorry, I missed that you had asked for that benchmark. The current implementation is the same speed on an ordered test and slower/broken for unordered test, although that may not be a fair benchmark because the rewrite doesn't sort. (and I can't code using DataTables
using BenchmarkTools
small_ordered = DataTable(v1 = categorical(collect(1:1000)), v2 = categorical(fill(1, 1000)));
small_unordered = DataTable(v1 = categorical(rand(1000)), v2 = categorical(rand(1000)));
large_ordered = DataTable(v1 = categorical(collect(1:100000)), v2 = categorical(fill(1, 100000)));
large_unordered = DataTable(v1 = categorical(rand(100000)), v2 = categorical(rand(100000)));
wide_ordered = DataTable(v1 = categorical(collect(1:100000)),
v2 = categorical(fill(1, 100000)),
v3 = categorical(collect(1:100000)),
v4 = categorical(collect(1:100000)));
@benchmark groupby(small_ordered, [:v1, :v2])
@benchmark groupby(small_unordered, [:v1, :v2])
@benchmark groupby(large_ordered, [:v1, :v2])
@benchmark groupby(large_unordered, [:v1, :v2])
@benchmark groupby(wide_ordered, [:v1, :v2, :v3, :v4]) current julia> @benchmark groupby(small_ordered, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 329.38 KiB
allocs estimate: 9412
--------------
minimum time: 1.826 ms (0.00% GC)
median time: 1.961 ms (0.00% GC)
mean time: 2.071 ms (2.09% GC)
maximum time: 7.185 ms (49.60% GC)
--------------
samples: 2406
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(small_unordered, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 16.72 MiB
allocs estimate: 15905
--------------
minimum time: 10.202 ms (0.00% GC)
median time: 12.068 ms (14.76% GC)
mean time: 11.990 ms (12.14% GC)
maximum time: 16.225 ms (13.57% GC)
--------------
samples: 417
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large_ordered, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 44.40 MiB
allocs estimate: 1890360
--------------
minimum time: 213.324 ms (6.65% GC)
median time: 223.528 ms (7.69% GC)
mean time: 223.929 ms (7.53% GC)
maximum time: 240.368 ms (7.97% GC)
--------------
samples: 23
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large_unordered, [:v1, :v2])
ERROR: InexactError()
in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at ./array.jl:415
in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:146
in ##core#284() at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:283
in ##sample#285(::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:289
in #_run#4(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:317
in (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at ./<missing>:0
in anonymous at ./<missing>:?
in #run_result#16(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:33
in (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at ./<missing>:0
in #run#17(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:36
in (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at ./<missing>:0
in warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:71
julia> @benchmark groupby(wide_ordered, [:v1, :v2, :v3, :v4])
ERROR: InexactError()
in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at ./array.jl:415
in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:146
in ##core#287() at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:283
in ##sample#288(::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:289
in #_run#5(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:317
in (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at ./<missing>:0
in anonymous at ./<missing>:?
in #run_result#16(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:33
in (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at ./<missing>:0
in #run#17(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:36
in (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at ./<missing>:0
in warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:71 proposed note the addition of julia> @benchmark groupby(small_ordered, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 923.17 KiB
allocs estimate: 34720
--------------
minimum time: 1.093 ms (0.00% GC)
median time: 1.202 ms (0.00% GC)
mean time: 1.448 ms (12.50% GC)
maximum time: 8.744 ms (79.89% GC)
--------------
samples: 3432
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(small_unordered, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 939.94 KiB
allocs estimate: 35487
--------------
minimum time: 1.203 ms (0.00% GC)
median time: 1.328 ms (0.00% GC)
mean time: 1.575 ms (11.69% GC)
maximum time: 8.675 ms (81.17% GC)
--------------
samples: 3156
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large_ordered, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 104.09 MiB
allocs estimate: 4357825
--------------
minimum time: 220.576 ms (15.41% GC)
median time: 280.141 ms (24.22% GC)
mean time: 301.521 ms (31.36% GC)
maximum time: 394.553 ms (43.66% GC)
--------------
samples: 17
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(large_unordered, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 104.07 MiB
allocs estimate: 4357201
--------------
minimum time: 239.889 ms (13.99% GC)
median time: 304.983 ms (24.08% GC)
mean time: 325.885 ms (29.55% GC)
maximum time: 432.313 ms (43.14% GC)
--------------
samples: 16
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark groupby(wide_ordered, [:v1, :v2, :v3, :v4])
BenchmarkTools.Trial:
memory estimate: 175.00 MiB
allocs estimate: 7968344
--------------
minimum time: 399.089 ms (17.33% GC)
median time: 441.455 ms (21.55% GC)
mean time: 463.932 ms (25.80% GC)
maximum time: 573.032 ms (37.77% GC)
--------------
samples: 11
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% One more point I'd like to bring into the discussion is where And leaving this here for future reference: Pandas benchmarks |
Thanks for the link. It indeed seems to explain how it works (though I still have to read it more carefully). This part in particular seems to imply we really need the complex logic to be fast:
To sort categorical arrays, for now just pass Depending on how generic we can make them, the grouping features may end up in AbstractTables or in a query package, but I'm not sure that will be possible. Anyway for now having a good implementation in this package would be a great step. |
In recognition that this rewrite is an improvement over the current implementation (it doesn't break), but also is inferior to the ideal implementation (a better translation of the Pandas algorithm) what are your thoughts on merging this and opening an issue to improve further? |
I think we need to be careful, because the slowness of the current implementation is suspicious. Indeed, a quick investigation showed that it is type-unstable. With #12, I get a much better performance when all columns are categorical arrays; the improvement is more limited with other column types, but it's no longer so clear that this PR's implementation would have the same performance. |
Ah, and I forgot there was JuliaData/DataFrames.jl#850 (see in particular these lines). |
Looks like there are several ways we can improve upon the current groupby! I've been looking around for papers on the topic to see if any consensus is out there on what the best algorithm/approach is. Ideally, that paper would have the pseudocode to make implementing it easy as well. I'm still looking. I've got family in town and won't be able to review this in detail for a few days. I'll follow along with any progress and jump back in next week to learn more about JuliaData/DataFrames.jl#850 and #12 work by stepping through the code and benchmarking. In general, I'm ok with any approach that alleviates failures on large numbers of groups |
It would be great if you can find such a paper, but if not I would think Pandas approach is a good model to follow. |
n = length(x) | ||
# counts = x.pool | ||
counts = fill(0, ngroups + 1) | ||
for i = 1:n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to @inbounds
this loop (and the others in this function)
end | ||
|
||
# mark the start of each contiguous group of like-indexed data | ||
where = fill(1, ngroups + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much it really matters, but it might be best to avoid using where
as a variable name since it has syntactic meaning in 0.6.
Closing since #17 replaced this. |
This pull request addresses JuliaData/DataFrames.jl#985 and also offers some significant performance gains as the size of the DataTable increases. It's about an order of magnitude slower on small DataTables.
test code used to compare implementations
current implementation
proposed rewrite
of note, this change results in more errors because categorial DataTables cannot be sorted. That may be something to handle seperately, but I thought I'd ask for feedback on this before moving forward. I think we could have a keyword variable
sort = true
, which the user can turn off to save runtime. We can also detect which columns are categoricals and avoid sorting those columns.Thanks for any feedback!