Skip to content
This repository was archived by the owner on May 5, 2019. It is now read-only.

rewrite groupby #3

Closed
wants to merge 14 commits into from
Closed

rewrite groupby #3

wants to merge 14 commits into from

Conversation

cjprybol
Copy link
Contributor

This pull request addresses JuliaData/DataFrames.jl#985 and also offers some significant performance gains as the size of the DataTable increases. It's about an order of magnitude slower on small DataTables.

test code used to compare implementations

using DataTables
using BenchmarkTools

srand(1);
a = rand([1, 2, Nullable(), 4], 20);
b = rand([:a, :b], 20);
c = rand(["a", "b"], 20);
d = rand(rand(2), 20);

A = repeat(rand(8891), inner=2);
B = repeat(rand(8891), inner=2);
C = repeat(rand(8891), inner=2);
D = repeat(rand(8891), inner=2);
E = repeat(rand(8891), inner=2);
F = repeat(rand(17782));
small = DataTable(A = a, B = b, C = c, D = c);
large = DataTable(A = A, B = B, C = C, D = D, E = E, F = F);

head(small)
@benchmark groupby(small, [:A, :B])
head(large)
@benchmark groupby(large, [:A, :B])
@benchmark groupby(large, [:A, :B, :C, :D, :E])

current implementation

julia> head(small)
6×4 DataTables.DataTable
│ Row │ A     │ B  │ C │ D │
├─────┼───────┼────┼───┼───┤
│ 1#NULL │ :a │ b │ b │2#NULL │ :b │ b │ b │34:b │ b │ b │
│ 42:a │ b │ b │
│ 52:a │ a │ a │
│ 6#NULL │ :a │ a │ a │

julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  11.95 KiB
  allocs estimate:  185
  --------------
  minimum time:     37.121 μs (0.00% GC)
  median time:      43.135 μs (0.00% GC)
  mean time:        44.989 μs (2.79% GC)
  maximum time:     2.659 ms (97.18% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> head(large)
6×6 DataTables.DataTable
│ Row │ A        │ B        │ C        │ D          │ E        │ F         │
├─────┼──────────┼──────────┼──────────┼────────────┼──────────┼───────────┤
│ 10.3719190.3515710.8394130.009280370.6377620.911623  │
│ 20.3719190.3515710.8394130.009280370.6377620.552918  │
│ 30.2033290.2278150.7760440.07804130.3269020.0208977 │
│ 40.2033290.2278150.7760440.07804130.3269020.326447  │
│ 50.2199750.6458890.8904730.6383250.7725820.691228  │
│ 60.2199750.6458890.8904730.6383250.7725820.290886  │

julia> @benchmark groupby(large, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  1.26 GiB
  allocs estimate:  388362
  --------------
  minimum time:     1.227 s (13.77% GC)
  median time:      1.288 s (19.05% GC)
  mean time:        1.277 s (18.02% GC)
  maximum time:     1.303 s (19.08% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
ERROR: InexactError()
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at ./array.jl:415
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
 in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:146
 in ##core#281() at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:283
 in ##sample#282(::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:289
 in #_run#3(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:317
 in (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in anonymous at ./<missing>:?
 in #run_result#16(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:33
 in (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in #run#17(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:36
 in (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:71

proposed rewrite

julia> head(small)
6×4 DataTables.DataTable
│ Row │ A     │ B  │ C │ D │
├─────┼───────┼────┼───┼───┤
│ 1#NULL │ :a │ b │ b │2#NULL │ :b │ b │ b │34:b │ b │ b │
│ 42:a │ b │ b │
│ 52:a │ a │ a │
│ 6#NULL │ :a │ a │ a │

julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  84.78 KiB
  allocs estimate:  2022
  --------------
  minimum time:     216.266 μs (0.00% GC)
  median time:      248.696 μs (0.00% GC)
  mean time:        268.508 μs (4.96% GC)
  maximum time:     6.397 ms (91.62% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> head(large)
6×6 DataTables.DataTable
│ Row │ A        │ B        │ C        │ D          │ E        │ F         │
├─────┼──────────┼──────────┼──────────┼────────────┼──────────┼───────────┤
│ 10.3719190.3515710.8394130.009280370.6377620.911623  │
│ 20.3719190.3515710.8394130.009280370.6377620.552918  │
│ 30.2033290.2278150.7760440.07804130.3269020.0208977 │
│ 40.2033290.2278150.7760440.07804130.3269020.326447  │
│ 50.2199750.6458890.8904730.6383250.7725820.691228  │
│ 60.2199750.6458890.8904730.6383250.7725820.290886  │

julia> @benchmark groupby(large, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  121.64 MiB
  allocs estimate:  3626397
  --------------
  minimum time:     341.754 ms (6.99% GC)
  median time:      360.711 ms (8.74% GC)
  mean time:        365.040 ms (9.95% GC)
  maximum time:     434.053 ms (24.62% GC)
  --------------
  samples:          14
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
  memory estimate:  125.58 MiB
  allocs estimate:  4912160
  --------------
  minimum time:     260.275 ms (9.85% GC)
  median time:      283.596 ms (11.17% GC)
  mean time:        289.176 ms (12.46% GC)
  maximum time:     352.749 ms (12.26% GC)
  --------------
  samples:          18
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

of note, this change results in more errors because categorial DataTables cannot be sorted. That may be something to handle seperately, but I thought I'd ask for feedback on this before moving forward. I think we could have a keyword variable sort = true, which the user can turn off to save runtime. We can also detect which columns are categoricals and avoid sorting those columns.

julia> df2 = DataTable(v1 = categorical(collect(1:1000)), v2 = categorical(fill(1, 1000)));

julia> head(df2)
6×2 DataTables.DataTable
│ Row │ v1 │ v2 │
├─────┼────┼────┤
│ 111  │
│ 221  │
│ 331  │
│ 441  │
│ 551  │
│ 661  │

julia> groupby(df2, [:v1, :v2]).starts
ERROR: Unordered CategoricalValue objects cannot be tested for order; use the ordered! function on the parent array to change this
 in isless(::CategoricalArrays.CategoricalValue{Int64,UInt32}, ::CategoricalArrays.CategoricalValue{Int64,UInt32}) at /Users/Cameron/.julia/v0.5/CategoricalArrays/src/value.jl:69
 in isless(::Nullable{CategoricalArrays.CategoricalValue{Int64,UInt32}}, ::Nullable{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /Users/Cameron/.julia/v0.5/NullableArrays/src/operators.jl:154
 in lt(::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}, ::Int64, ::Int64) at /Users/Cameron/.julia/v0.5/DataTables/src/abstractdatatable/sort.jl:89
 in sort!(::Array{Int64,1}, ::Int64, ::Int64, ::Base.Sort.InsertionSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}) at ./sort.jl:223
 in sort!(::Array{Int64,1}, ::Int64, ::Int64, ::Base.Sort.MergeSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}, ::Array{Int64,1}) at ./sort.jl:312
 in sort!(::Array{Int64,1}, ::Int64, ::Int64, ::Base.Sort.MergeSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}, ::Array{Int64,1}) at ./sort.jl:317 (repeats 6 times)
 in sort!(::Array{Int64,1}, ::Base.Sort.MergeSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}) at ./sort.jl:406
 in sort!(::DataTables.DataTable, ::Base.Sort.MergeSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}) at /Users/Cameron/.julia/v0.5/DataTables/src/datatable/sort.jl:13
 in #sort!#117(::Array{Symbol,1}, ::Void, ::Function, ::Function, ::Bool, ::Base.Order.ForwardOrdering, ::Function, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.5/DataTables/src/datatable/sort.jl:9
 in (::Base.#kw##sort!)(::Array{Any,1}, ::Base.#sort!, ::DataTables.DataTable) at ./<missing>:0
 in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:141

julia> sort!(df2)
ERROR: Unordered CategoricalValue objects cannot be tested for order; use the ordered! function on the parent array to change this
 in isless(::CategoricalArrays.CategoricalValue{Int64,UInt32}, ::CategoricalArrays.CategoricalValue{Int64,UInt32}) at /Users/Cameron/.julia/v0.5/CategoricalArrays/src/value.jl:69
 in isless(::Nullable{CategoricalArrays.CategoricalValue{Int64,UInt32}}, ::Nullable{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /Users/Cameron/.julia/v0.5/NullableArrays/src/operators.jl:154
 in lt(::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}, ::Int64, ::Int64) at /Users/Cameron/.julia/v0.5/DataTables/src/abstractdatatable/sort.jl:89
 in sort!(::Array{Int64,1}, ::Int64, ::Int64, ::Base.Sort.InsertionSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}) at ./sort.jl:223
 in sort!(::Array{Int64,1}, ::Int64, ::Int64, ::Base.Sort.MergeSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}, ::Array{Int64,1}) at ./sort.jl:312
 in sort!(::Array{Int64,1}, ::Int64, ::Int64, ::Base.Sort.MergeSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}, ::Array{Int64,1}) at ./sort.jl:317 (repeats 6 times)
 in sort!(::Array{Int64,1}, ::Base.Sort.MergeSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}) at ./sort.jl:406
 in sort!(::DataTables.DataTable, ::Base.Sort.MergeSortAlg, ::DataTables.DFPerm{Base.Order.ForwardOrdering,DataTables.DataTable}) at /Users/Cameron/.julia/v0.5/DataTables/src/datatable/sort.jl:13
 in #sort!#117(::Array{Any,1}, ::Void, ::Function, ::Function, ::Bool, ::Base.Order.ForwardOrdering, ::Function, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.5/DataTables/src/datatable/sort.jl:9
 in sort!(::DataTables.DataTable) at /Users/Cameron/.julia/v0.5/DataTables/src/datatable/sort.jl:3

Thanks for any feedback!

nalimilan and others added 5 commits February 11, 2017 15:59
Replace all occurrences of the term and rename files accordingly.
Also remove NEWS.md and update README.md with new links.
@cjprybol cjprybol changed the title rewrite groupby [WIP] rewrite groupby Feb 11, 2017
@ararslan
Copy link
Member

Nice! Any idea whether performance improves further using @inbounds as appropriate?

@nalimilan
Copy link
Member

Cool! Since you have written benchmarks, it would be great to use the PkgBenchmark.jl framework and add them to the repo.

WRT sorting, I would have a look at what e.g. dplyr does. I would think the number of groups is relatively small in general, so that sorting by default doesn't have a large performance impact?

The case of categorical arrays is very special. isless throws an error when for unordered arrays to protect users from meaningless comparisons. Yet, for presentation purposes, the order of levels makes a lot of sense. So you could just special case them. In general I think you should wrap the call to sort inside a try... catch block so that using a type which cannot be sorted does make everything fail.

A bit off-topic, but I wonder whether we should allow comparisons for any categorical arrays, ordered or not. Indeed, strings can be tested for order, so you're not protected for meaningless comparisons on nominal variables after importing them from CSV or a database either. If we do that, the ordered/unordered distinction would lose most of its interest, which would simplify the code a bit. I think my preferred solution would be to stop supporting < on strings...

@cjprybol
Copy link
Contributor Author

The majority of the time is spent on sorting. If we drop sorting by default then the savings are 50-90%

julia> head(small)
6×4 DataTables.DataTable
│ Row │ A     │ B  │ C │ D │
├─────┼───────┼────┼───┼───┤
│ 1#NULL │ :a │ b │ b │2#NULL │ :b │ b │ b │34:b │ b │ b │
│ 42:a │ b │ b │
│ 52:a │ a │ a │
│ 6#NULL │ :a │ a │ a │

julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  84.78 KiB
  allocs estimate:  2022
  --------------
  minimum time:     229.404 μs (0.00% GC)
  median time:      277.827 μs (0.00% GC)
  mean time:        308.226 μs (4.49% GC)
  maximum time:     5.708 ms (91.73% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(small, [:A, :B], sort=false)
BenchmarkTools.Trial:
  memory estimate:  24.45 KiB
  allocs estimate:  780
  --------------
  minimum time:     29.001 μs (0.00% GC)
  median time:      33.321 μs (0.00% GC)
  mean time:        39.536 μs (9.36% GC)
  maximum time:     4.111 ms (97.13% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> head(large)
6×6 DataTables.DataTable
│ Row │ A        │ B        │ C        │ D          │ E        │ F         │
├─────┼──────────┼──────────┼──────────┼────────────┼──────────┼───────────┤
│ 10.3719190.3515710.8394130.009280370.6377620.911623  │
│ 20.3719190.3515710.8394130.009280370.6377620.552918  │
│ 30.2033290.2278150.7760440.07804130.3269020.0208977 │
│ 40.2033290.2278150.7760440.07804130.3269020.326447  │
│ 50.2199750.6458890.8904730.6383250.7725820.691228  │
│ 60.2199750.6458890.8904730.6383250.7725820.290886  │

julia> @benchmark groupby(large, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  121.64 MiB
  allocs estimate:  3626397
  --------------
  minimum time:     370.119 ms (7.40% GC)
  median time:      392.355 ms (9.03% GC)
  mean time:        396.227 ms (10.19% GC)
  maximum time:     479.304 ms (26.35% GC)
  --------------
  samples:          13
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large, [:A, :B], sort=false)
BenchmarkTools.Trial:
  memory estimate:  26.73 MiB
  allocs estimate:  1193286
  --------------
  minimum time:     41.327 ms (9.48% GC)
  median time:      47.171 ms (9.35% GC)
  mean time:        47.492 ms (10.68% GC)
  maximum time:     60.089 ms (17.36% GC)
  --------------
  samples:          106
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
  memory estimate:  125.58 MiB
  allocs estimate:  4912160
  --------------
  minimum time:     274.978 ms (10.81% GC)
  median time:      296.645 ms (11.31% GC)
  mean time:        300.621 ms (12.86% GC)
  maximum time:     384.436 ms (32.78% GC)
  --------------
  samples:          17
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large, [:A, :B, :C, :D, :E], sort=false)
BenchmarkTools.Trial:
  memory estimate:  57.25 MiB
  allocs estimate:  2681427
  --------------
  minimum time:     80.231 ms (14.07% GC)
  median time:      92.527 ms (16.50% GC)
  mean time:        94.756 ms (18.43% GC)
  maximum time:     164.173 ms (53.89% GC)
  --------------
  samples:          53
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

not sorting also fixes the categorical issue.

julia> df2 = DataTable(v1 = categorical(collect(1:1000)), v2 = categorical(fill(1, 1000)));

julia> groupby(df2, [:v1, :v2], sort=false)
DataTables.GroupedDataTable  1000 groups with keys: Symbol[:v1,:v2]
First Group:
1×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ v1  │ v2 │
├─────┼─────┼────┤
│ 16441
Last Group:
1×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ v1  │ v2 │
├─────┼─────┼────┤
│ 11811

I'm tracking the groups with a Dict so these groupings are coming out in random order. If we utilize an OrderedDict then this example would retain the initial 1:1000 ordering.

@cjprybol
Copy link
Contributor Author

Now retains ordering and checks for categoricals

julia> df2 = DataTable(v1 = categorical(collect(1:1000)), v2 = categorical(fill(1, 1000)));

julia> groupby(df2, [:v1, :v2])
DataTables.GroupedDataTable  1000 groups with keys: Symbol[:v1,:v2]
First Group:
1×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ v1 │ v2 │
├─────┼────┼────┤
│ 111
Last Group:
1×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ v1   │ v2 │
├─────┼──────┼────┤
│ 110001

@cjprybol
Copy link
Contributor Author

I think that's a great idea but I'd like to make that a separate commit. I'd also like to make sure everything works before worrying about automated benchmarking.

Cool! Since you have written benchmarks, it would be great to use the PkgBenchmark.jl framework and add them to the repo.

Given the serious speedup of not sorting, I've swapped sorting to be off by default. What are everyone's thoughts on not sorting in groupby at all? I implemented sorting to maintain consistency with the current implementation, but @kprybol raised the valid point that because the user can pre-sort before groupby, sorting within groupby is functional redundancy and probably should not be part of this function

WRT sorting, I would have a look at what e.g. dplyr does. I would think the number of groups is relatively small in general, so that sorting by default doesn't have a large performance impact?

This too can be addressed by letting the user pre-sort the DataTable as they see fit, although I'm currently disabling sort if CategoricalArrays are present

The case of categorical arrays is very special. isless throws an error when for unordered arrays to protect users from meaningless comparisons. Yet, for presentation purposes, the order of levels makes a lot of sense. So you could just special case them. In general I think you should wrap the call to sort inside a try... catch block so that using a type which cannot be sorted does make everything fail.

@cjprybol
Copy link
Contributor Author

cjprybol commented Feb 12, 2017

current

julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  11.95 KiB
  allocs estimate:  185
  --------------
  minimum time:     37.121 μs (0.00% GC)
  median time:      43.135 μs (0.00% GC)
  mean time:        44.989 μs (2.79% GC)
  maximum time:     2.659 ms (97.18% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  1.26 GiB
  allocs estimate:  388362
  --------------
  minimum time:     1.227 s (13.77% GC)
  median time:      1.288 s (19.05% GC)
  mean time:        1.277 s (18.02% GC)
  maximum time:     1.303 s (19.08% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
ERROR: InexactError()
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at ./array.jl:415
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
 in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:146
 in ##core#281() at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:283
 in ##sample#282(::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:289
 in #_run#3(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:317
 in (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in anonymous at ./<missing>:?
 in #run_result#16(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:33
 in (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in #run#17(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:36
 in (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#280")}) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:71

proposed

julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  19.86 KiB
  allocs estimate:  670
  --------------
  minimum time:     24.267 μs (0.00% GC)
  median time:      27.229 μs (0.00% GC)
  mean time:        32.302 μs (8.62% GC)
  maximum time:     3.849 ms (97.77% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large, [:A, :B])
BenchmarkTools.Trial:
  memory estimate:  23.51 MiB
  allocs estimate:  1032444
  --------------
  minimum time:     36.850 ms (7.93% GC)
  median time:      40.266 ms (9.30% GC)
  mean time:        40.605 ms (9.79% GC)
  maximum time:     55.644 ms (8.94% GC)
  --------------
  samples:          124
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
  memory estimate:  50.27 MiB
  allocs estimate:  2340416
  --------------
  minimum time:     75.170 ms (10.23% GC)
  median time:      81.381 ms (13.28% GC)
  mean time:        83.314 ms (14.56% GC)
  maximum time:     138.363 ms (49.04% GC)
  --------------
  samples:          60
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

just removed: sorting within groupby (as it was a few commits ago)

julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
  memory estimate:  125.58 MiB
  allocs estimate:  4912160
  --------------
  minimum time:     274.978 ms (10.81% GC)
  median time:      296.645 ms (11.31% GC)
  mean time:        300.621 ms (12.86% GC)
  maximum time:     384.436 ms (32.78% GC)
  --------------
  samples:          17
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

a reason why it was removed: faster to presort and feed to non-sorting groupby

julia> @benchmark groupby(sort(large), [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
  memory estimate:  93.92 MiB
  allocs estimate:  4158227
  --------------
  minimum time:     112.816 ms (10.97% GC)
  median time:      120.869 ms (14.16% GC)
  mean time:        122.530 ms (15.05% GC)
  maximum time:     186.425 ms (46.27% GC)
  --------------
  samples:          41
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

row = DataTableRow(intersect, i)
if !haskey(mappings, row)
mappings[row] = [i]
elseif haskey(mappings, row)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The haskey check is redundant.

test/grouping.jl Outdated
@@ -38,7 +38,7 @@ module TestGrouping
@test groupby(df2, [:v2, :v1]).starts == collect(1:1000)

# grouping empty frame
@test groupby(DataTable(A=Int[]), :A).starts == Int[]
# @test groupby(DataTable(A=Int[]), :A).starts == Int[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you comment out these tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This re-write doesn't support empty DataTables

julia> groupby(DataTable(A=Int[]), :A).starts
ERROR: BoundsError: attempt to access 0-element Array{Array{Int64,1},1} at index [1]
 in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:136
 in groupby(::DataTables.DataTable, ::Symbol) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:147

I could add a check at the top of the function if nrow(d) == 0... but I don't see any reason to. Throwing an error seems like a good idea here. If someone is trying to groupby a DataTable then they probably also expect that DataTable to have some values in it.

end
(idx, starts) = groupsort_indexer(x, ngroups)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove the function above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is also used in join. Should we move the function to that file since it's no longer used here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be more logical, yes.

@nalimilan
Copy link
Member

How is it possible that sorting beforehand is much faster? Doesn't that mean the sorting implementation was sub-optimal? I would have thought sorting a GroupedDataTable only implied sorting the groups, which should be negligible in most cases (since typically the number of groups is much smaller than the number of rows). Also, could you check with a smaller number of unique values for large? Typically you won't group on variables with thousands of levels. (I wouldn't care about small at all since 20-row datasets is not a use case we should care about.)

Anyway, I would keep the keyword argument even if set to false by default: with real data sets, you have much more variables, and it's annoying to have to repeat the list of variables according to which sorting must be done. Even though dplyr does not seem to support this, Pandas does, and sorts by default.

As regards unordered categorical arrays, I'd be inclined to support isless but not <, so that sort works but people are still protected from comparison attempts. isless already orders NaN and Nullable(), so it's not like it only worked for meaningful comparisons.

Could you say a word about why your implementation is so much faster than the old one? Is that still the case when grouping on categorical arrays? I'm wondering because the code for groupsort_indexer says it's adapted from Pandas, so I guess it was reasonably optimized. Categorical arrays are very special because they allow for very fast indexing without a dictionary lookup. Of course the old code also suffered from the explosion of the number of possible levels when crossing variables, but maybe we could find a middle ground taking advantage of categorical array integer codes where possible.

@cjprybol
Copy link
Contributor Author

How is it possible that sorting beforehand is much faster? Doesn't that mean the sorting implementation was sub-optimal?

Yes, the sorting implementation I wrote was definitely suboptimal. I also don't know how to speed it up (besides just pre-sorting). The issue is that DataTables don't keep track of the original row index (to my knowledge), so when I sort each group I have to make a copy of the DataTable subset (the group), add a new column to keep track of the original rows, and then I use that column post-sort to re-order my indices. Here's an example that shows the sorting process (you'll have to imagine it's a real group).

julia> d = DataTable(A = collect('z':-1:'A'));

julia> temp = head(d)
6×1 DataTables.DataTable
│ Row │ A   │
├─────┼─────┤
│ 1'z' │
│ 2'y' │
│ 3'x' │
│ 4'w' │
│ 5'v' │
│ 6'u'# let's pretend we have a group and the members of that group are at these rows
julia> group_indices = [1, 5, 8, 10]
4-element Array{Int64,1}:
  1
  5
  8
 10

julia> temp = d[group_indices, :]
4×1 DataTables.DataTable
│ Row │ A   │
├─────┼─────┤
│ 1'z' │
│ 2'v' │
│ 3's' │
│ 4'q'# I need to make a new column to track order, which I assume is making a copy of the DataTable
julia> temp[:rowid] = collect(1:nrow(temp))
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> sort(temp)
4×2 DataTables.DataTable
│ Row │ A   │ rowid │
├─────┼─────┼───────┤
│ 1'q'4     │
│ 2's'3     │
│ 3'v'2     │
│ 4'z'1# I can now use the temporary column to reorder my indices and these get passed 
# to the GroupedDataFrame constructor
julia> group_indices = group_indices[sort(temp)[:rowid]]
4-element Array{Int64,1}:
 10
  8
  5
  1

So, sorting groups during the function is VERY inefficient. And we cannot sort the grouped DataTable

julia> small = DataTable(A = a, B = b, C = c, D = c);

julia> groupby(small, [:A, :B])
DataTables.GroupedDataTable  7 groups with keys: Symbol[:A,:B]
First Group:
3×4 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ A     │ B  │ C │ D │
├─────┼───────┼────┼───┼───┤
│ 1#NULL │ :a │ b │ b │2#NULL │ :a │ a │ a │3#NULL │ :a │ b │ b │

Last Group:
1×4 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ A │ B  │ C │ D │
├─────┼───┼────┼───┼───┤
│ 12:b │ b │ b │

julia> sort(ans)
ERROR: MethodError: no method matching sort(::DataTables.GroupedDataTable)
Closest candidates are:
  sort(::AbstractUnitRange{T}) at range.jl:843
  sort(::Range{T}) at range.jl:846
  sort{Tv,Ti}(::SparseVector{Tv,Ti}; kws...) at sparse/sparsevector.jl:1716

There might be other options for how to sort within groupby, but I feel strongly that the best option is to sort the DataTable at the very beginning and just keep track of the order using an OrderedDict.

We could sort within the function as the very first step in groupby, but if we do so then we're also responsible for supporting all of the potential cases of using sort that could be problematic like the case of CategoricalArrays. I'm worried that trying to support many styles of sorting within groupby could become a maintenance headache. If sorting beforehand is faster, conceptually straightforward, and easier to maintain, then I think it's the best way forward. I should add a note to the docstrings saying that pre-sorting is necessary if sorted output is desired, but I think it's a very reasonable expectation that the user can sort their DataTable. We also don't have to worry about breaking anyone's code because I don't think anyone is using this combination of NullableArrays, DataTable, groupby and there will be several breaking changes for users swapping from DataFrames to DataTables

Also, could you check with a smaller number of unique values for large? Typically you won't group on variables with thousands of levels.

I rewrote groupby specifically because I had a dataset like this! I couldn't move forward on my research project because I have a DataTable that has 8891 groups that could not be grouped. But yes, of course, I can (and should) check a smaller number of groups. Here is 10 groups repeated 1000x

julia> A = repeat(rand(10), inner=1000);

julia> B = repeat(rand(10), inner=1000);

julia> C = repeat(rand(10), inner=1000);

julia> D = repeat(rand(10), inner=1000);

julia> E = repeat(rand(10), inner=1000);

julia> F = repeat(rand(10000));

julia> large = DataTable(A = A, B = B, C = C, D = D, E = E, F = F);

julia> @benchmark groupby(large, [:A, :B, :C, :D, :E])
BenchmarkTools.Trial:
  memory estimate:  23.44 MiB
  allocs estimate:  1094488
  --------------
  minimum time:     28.802 ms (8.28% GC)
  median time:      30.470 ms (8.21% GC)
  mean time:        31.400 ms (8.80% GC)
  maximum time:     37.249 ms (8.10% GC)
  --------------
  samples:          160
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Anyway, I would keep the keyword argument even if set to false by default: with real data sets, you have much more variables, and it's annoying to have to repeat the list of variables according to which sorting must be done. Even though dplyr does not seem to support this, Pandas does, and sorts by default.

Pandas sorts by default but it also does not automatically reset the row indices, which is precisely the limitation that prevents sorting from being efficient within my groupby. If Julia had the same functionality than I could just use the row index

Python 3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import Pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randn(6,4))
          0         1         2         3
0  1.617402  1.279826 -1.272431  0.786087
1 -1.637442  1.152224  0.267500 -1.536474
2  0.969473  0.168263  1.022955 -0.769514
3 -0.386099 -0.493472 -1.106044  1.284097
4 -0.712302  1.951609  1.714415 -0.448146
5  1.586177  2.684248 -1.141071  2.184125
>>> d.index
RangeIndex(start=0, stop=6, step=1)
>>> d.sort(0)
          0         1         2         3
0 -1.451303 -1.077141 -0.769869  0.244627
2 -1.324284  0.742099 -1.347156  2.949335
5 -0.114870 -0.156239 -0.435413 -0.980006
1 -0.020488 -0.791392 -0.412640 -2.084760
3  1.026864  1.214448 -0.279307  0.911531
4  1.987862  2.016643  1.352183  1.010149
>>> d.sort(0).index
Int64Index([0, 2, 5, 1, 3, 4], dtype='int64')

As regards unordered categorical arrays, I'd be inclined to support isless but not <, so that sort works but people are still protected from comparison attempts. isless already orders NaN and Nullable(), so it's not like it only worked for meaningful comparisons.

This sounds great. I agree that supporting a reasonable amount of sorting functionality of levels within CategoricalArrays is a good idea.

Could you say a word about why your implementation is so much faster than the old one? Is that still the case when grouping on categorical arrays? I'm wondering because the code for groupsort_indexer says it's adapted from Pandas, so I guess it was reasonably optimized. Categorical arrays are very special because they allow for very fast indexing without a dictionary lookup. Of course the old code also suffered from the explosion of the number of possible levels when crossing variables, but maybe we could find a middle ground taking advantage of categorical array integer codes where possible.

I can't speak as to why this is faster than the Pandas algorithm because I don't quite understand the Pandas algorithm. Maybe @johnmyleswhite could be more helpful? It seems that this implementation is performing fewer calculations than the Pandas algorithm. It only looks at each piece of relevant data once and keeps track of no more than is required to properly generate a call to GroupedDataTable. It looks like the TODO note to reduce the number of potential groups has been around since the function was first written, and only that user powerdistribution likely knows how to implement a middle ground approach that takes advantage of categorical array integer codes. I certainly don't know how to implement it.

@cjprybol
Copy link
Contributor Author

As a humbling comparison, here's the same thing benchmarked in Python.

import pandas as pd
import numpy as np

a = np.random.choice([1, 2, np.NaN, 4], 20)
b = np.random.choice(["a", "b"], 20)
c = np.random.choice([True, False], 20)
d = np.random.choice(np.random.rand(2), 20)

A = np.repeat(np.random.rand(8891), 2)
B = np.repeat(np.random.rand(8891), 2)
C = np.repeat(np.random.rand(8891), 2)
D = np.repeat(np.random.rand(8891), 2)
E = np.repeat(np.random.rand(8891), 2)
F = np.random.rand(17782)
small = pd.DataFrame({'A' : a, 'B' : b, 'C' : c, 'D' : d})
large = pd.DataFrame({'A' : A, 'B' : B, 'C' : C, 'D' : D, 'E' : E, 'F' : F})
small.head()
large.head()

results

In [11]: %timeit small.groupby(['A', 'B'])
The slowest run took 44.56 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 62.5 µs per loop

In [12]: %timeit large.groupby(['A', 'B'])
The slowest run took 6.38 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 62.4 µs per loop

In [13]: %timeit large.groupby(['A', 'B', 'C', 'D', 'E'])
The slowest run took 4.45 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 117 µs per loop

so we probably should use the Pandas approach again in the future, but I think this rewrite is a suitable stepping stone to utilize for the time being

@cjprybol cjprybol changed the title [WIP] rewrite groupby rewrite groupby Feb 14, 2017
@nalimilan
Copy link
Member

Given the very good Pandas timings, I'm reluctant to move away from their algorithm without a deeper examination of what they are doing. Could you have a look at their code, and in particular at how they work around the limitation on the number of groups? I've looked a bit in their code base, and I finally found this function which seems to match our current algorithm. Looks like there's logic to stop before the number of groups gets too large, by dropping empty levels and starting again, but I haven't investigated enough to understand how that works.

I can't speak as to why this is faster than the Pandas algorithm because I don't quite understand the Pandas algorithm. Maybe @johnmyleswhite could be more helpful? It seems that this implementation is performing fewer calculations than the Pandas algorithm. It only looks at each piece of relevant data once and keeps track of no more than is required to properly generate a call to GroupedDataTable. It looks like the TODO note to reduce the number of potential groups has been around since the function was first written, and only that user powerdistribution likely knows how to implement a middle ground approach that takes advantage of categorical array integer codes. I certainly don't know how to implement it.

What makes the Pandas algorithm more efficient is likely that for each row, only an array indexing operation is needed (repeated for each column), which is much cheaper than a dictionary lookup. The number of calculations does not mean much if you don't take into account how costly they are.

That's why I asked for a benchmark when all columns are categorical arrays: in that case, the integer codes have already been computed and no dictionary lookup is needed at all. I would be surprised if your algorithm is faster than the current one in that case -- that would be an indication that something is clearly wrong in that code. When the input is not a categorical array, the existing code creates a new one, which is costly (and which I haven't benchmarked seriously yet, so could be a bottleneck).

It would also be interesting to know what Pandas does for e.g. string columns: does it perform the conversion as we do, or does it use a different code path?

@cjprybol
Copy link
Contributor Author

Given the very good Pandas timings, I'm reluctant to move away from their algorithm without a deeper examination of what they are doing. Could you have a look at their code, and in particular at how they work around the limitation on the number of groups?

The broken link in the existing function I think is supposed to reference this page which describes the high-level concepts of the algorithm design. Should help us make sense of the actual code.

That's why I asked for a benchmark when all columns are categorical arrays

Sorry, I missed that you had asked for that benchmark. The current implementation is the same speed on an ordered test and slower/broken for unordered test, although that may not be a fair benchmark because the rewrite doesn't sort. (and I can't sort the categorical arrays so I can't groupby(sort(small_unordered), [:v1, :v2]) to make it fair as I did in other tests).

code

using DataTables
using BenchmarkTools

small_ordered = DataTable(v1 = categorical(collect(1:1000)), v2 = categorical(fill(1, 1000)));
small_unordered = DataTable(v1 = categorical(rand(1000)), v2 = categorical(rand(1000)));
large_ordered = DataTable(v1 = categorical(collect(1:100000)), v2 = categorical(fill(1, 100000)));
large_unordered = DataTable(v1 = categorical(rand(100000)), v2 = categorical(rand(100000)));
wide_ordered = DataTable(v1 = categorical(collect(1:100000)),
                         v2 = categorical(fill(1, 100000)),
                         v3 = categorical(collect(1:100000)),
                         v4 = categorical(collect(1:100000)));
@benchmark groupby(small_ordered, [:v1, :v2])
@benchmark groupby(small_unordered, [:v1, :v2])
@benchmark groupby(large_ordered, [:v1, :v2])
@benchmark groupby(large_unordered, [:v1, :v2])
@benchmark groupby(wide_ordered, [:v1, :v2, :v3, :v4])

current

julia> @benchmark groupby(small_ordered, [:v1, :v2])
BenchmarkTools.Trial:
  memory estimate:  329.38 KiB
  allocs estimate:  9412
  --------------
  minimum time:     1.826 ms (0.00% GC)
  median time:      1.961 ms (0.00% GC)
  mean time:        2.071 ms (2.09% GC)
  maximum time:     7.185 ms (49.60% GC)
  --------------
  samples:          2406
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(small_unordered, [:v1, :v2])
BenchmarkTools.Trial:
  memory estimate:  16.72 MiB
  allocs estimate:  15905
  --------------
  minimum time:     10.202 ms (0.00% GC)
  median time:      12.068 ms (14.76% GC)
  mean time:        11.990 ms (12.14% GC)
  maximum time:     16.225 ms (13.57% GC)
  --------------
  samples:          417
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large_ordered, [:v1, :v2])
BenchmarkTools.Trial:
  memory estimate:  44.40 MiB
  allocs estimate:  1890360
  --------------
  minimum time:     213.324 ms (6.65% GC)
  median time:      223.528 ms (7.69% GC)
  mean time:        223.929 ms (7.53% GC)
  maximum time:     240.368 ms (7.97% GC)
  --------------
  samples:          23
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large_unordered, [:v1, :v2])
ERROR: InexactError()
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at ./array.jl:415
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
 in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:146
 in ##core#284() at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:283
 in ##sample#285(::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:289
 in #_run#4(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:317
 in (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in anonymous at ./<missing>:?
 in #run_result#16(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:33
 in (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in #run#17(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:36
 in (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#283")}) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:71

julia> @benchmark groupby(wide_ordered, [:v1, :v2, :v3, :v4])
ERROR: InexactError()
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at ./array.jl:415
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
 in groupby(::DataTables.DataTable, ::Array{Symbol,1}) at /Users/Cameron/.julia/v0.5/DataTables/src/groupeddatatable/grouping.jl:146
 in ##core#287() at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:283
 in ##sample#288(::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:289
 in #_run#5(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:317
 in (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in anonymous at ./<missing>:?
 in #run_result#16(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:33
 in (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in #run#17(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:36
 in (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#286")}) at /Users/Cameron/.julia/v0.5/BenchmarkTools/src/execution.jl:71

proposed note the addition of sort should probably be used to make it a truly fair comparison, but sort doesn't work on CategoricalArrays

julia> @benchmark groupby(small_ordered, [:v1, :v2])
BenchmarkTools.Trial:
  memory estimate:  923.17 KiB
  allocs estimate:  34720
  --------------
  minimum time:     1.093 ms (0.00% GC)
  median time:      1.202 ms (0.00% GC)
  mean time:        1.448 ms (12.50% GC)
  maximum time:     8.744 ms (79.89% GC)
  --------------
  samples:          3432
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(small_unordered, [:v1, :v2])
BenchmarkTools.Trial:
  memory estimate:  939.94 KiB
  allocs estimate:  35487
  --------------
  minimum time:     1.203 ms (0.00% GC)
  median time:      1.328 ms (0.00% GC)
  mean time:        1.575 ms (11.69% GC)
  maximum time:     8.675 ms (81.17% GC)
  --------------
  samples:          3156
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large_ordered, [:v1, :v2])
BenchmarkTools.Trial:
  memory estimate:  104.09 MiB
  allocs estimate:  4357825
  --------------
  minimum time:     220.576 ms (15.41% GC)
  median time:      280.141 ms (24.22% GC)
  mean time:        301.521 ms (31.36% GC)
  maximum time:     394.553 ms (43.66% GC)
  --------------
  samples:          17
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(large_unordered, [:v1, :v2])
BenchmarkTools.Trial:
  memory estimate:  104.07 MiB
  allocs estimate:  4357201
  --------------
  minimum time:     239.889 ms (13.99% GC)
  median time:      304.983 ms (24.08% GC)
  mean time:        325.885 ms (29.55% GC)
  maximum time:     432.313 ms (43.14% GC)
  --------------
  samples:          16
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(wide_ordered, [:v1, :v2, :v3, :v4])
BenchmarkTools.Trial:
  memory estimate:  175.00 MiB
  allocs estimate:  7968344
  --------------
  minimum time:     399.089 ms (17.33% GC)
  median time:      441.455 ms (21.55% GC)
  mean time:        463.932 ms (25.80% GC)
  maximum time:     573.032 ms (37.77% GC)
  --------------
  samples:          11
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

One more point I'd like to bring into the discussion is where groupby should live. The goal is still to transfer from Data(Frames|Tables)Meta to StructuredQueries and/or Query, correct? Will those projects be worked on as GSoC projects? Should we focus an effort regarding an optimized groupby to live within those frameworks?

And leaving this here for future reference: Pandas benchmarks

@nalimilan
Copy link
Member

Thanks for the link. It indeed seems to explain how it works (though I still have to read it more carefully). This part in particular seems to imply we really need the complex logic to be fast:

You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.

To sort categorical arrays, for now just pass ordered=true on construction.

Depending on how generic we can make them, the grouping features may end up in AbstractTables or in a query package, but I'm not sure that will be possible. Anyway for now having a good implementation in this package would be a great step.

@cjprybol
Copy link
Contributor Author

In recognition that this rewrite is an improvement over the current implementation (it doesn't break), but also is inferior to the ideal implementation (a better translation of the Pandas algorithm) what are your thoughts on merging this and opening an issue to improve further?

@cjprybol cjprybol changed the base branch from nl/rename to master February 15, 2017 20:23
@nalimilan
Copy link
Member

I think we need to be careful, because the slowness of the current implementation is suspicious. Indeed, a quick investigation showed that it is type-unstable. With #12, I get a much better performance when all columns are categorical arrays; the improvement is more limited with other column types, but it's no longer so clear that this PR's implementation would have the same performance.

@nalimilan
Copy link
Member

Ah, and I forgot there was JuliaData/DataFrames.jl#850 (see in particular these lines).

@cjprybol
Copy link
Contributor Author

Looks like there are several ways we can improve upon the current groupby! I've been looking around for papers on the topic to see if any consensus is out there on what the best algorithm/approach is. Ideally, that paper would have the pseudocode to make implementing it easy as well. I'm still looking.

I've got family in town and won't be able to review this in detail for a few days. I'll follow along with any progress and jump back in next week to learn more about JuliaData/DataFrames.jl#850 and #12 work by stepping through the code and benchmarking. In general, I'm ok with any approach that alleviates failures on large numbers of groups

@nalimilan
Copy link
Member

It would be great if you can find such a paper, but if not I would think Pandas approach is a good model to follow.

n = length(x)
# counts = x.pool
counts = fill(0, ngroups + 1)
for i = 1:n
Copy link
Member

@ararslan ararslan Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to @inbounds this loop (and the others in this function)

end

# mark the start of each contiguous group of like-indexed data
where = fill(1, ngroups + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much it really matters, but it might be best to avoid using where as a variable name since it has syntactic meaning in 0.6.

@nalimilan
Copy link
Member

Closing since #17 replaced this.

@nalimilan nalimilan closed this Mar 6, 2017
@cjprybol cjprybol deleted the cjp/groupby branch March 7, 2017 04:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants