Skip to content
This repository was archived by the owner on May 5, 2019. It is now read-only.

WIP: Fix type instability in groupby() #12

Closed
wants to merge 4 commits into from
Closed

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Feb 16, 2017

This ensures the loop over rows is type stable, which increases performance
a lot.

This is WIP because it must be balanced with #3. It's mainly intended to allow for performance comparisons.

The same strategy could be applied to DataFrames, which are faster than DataTables currently, but much slower than with this PR.

using DataTables
using BenchmarkTools

dt1 = DataFrame(v1 = PooledDataArray(repeat(1:10, inner=1000)), v2 = PooledDataArray(repeat(1:10, outer=1000)));
dt2 = DataFrame(v1 = PooledDataArray(repeat(1:100, inner=100)), v2 = PooledDataArray(repeat(1:100, outer=100)));
dt3 = hcat(dt1, dt2)

@benchmark groupby(dt1, [:v1, :v2])
@benchmark groupby(dt2, [:v1, :v2])
@benchmark groupby(dt3, [:v1, :v2, :v1_1, :v2_1])


# BEFORE
julia> @benchmark groupby(dt1, [:v1, :v2])

BenchmarkTools.Trial: 
  memory estimate:  1.22 MiB
  allocs estimate:  66536
  --------------
  minimum time:     4.527 ms (0.00% GC)
  median time:      4.747 ms (0.00% GC)
  mean time:        5.047 ms (2.08% GC)
  maximum time:     12.827 ms (18.36% GC)
  --------------
  samples:          988
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(dt2, [:v1, :v2])
BenchmarkTools.Trial: 
  memory estimate:  1.99 MiB
  allocs estimate:  85972
  --------------
  minimum time:     5.063 ms (0.00% GC)
  median time:      5.518 ms (0.00% GC)
  mean time:        6.066 ms (3.17% GC)
  maximum time:     13.451 ms (23.09% GC)
  --------------
  samples:          821
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(dt3, [:v1, :v2, :v1_1, :v2_1])
BenchmarkTools.Trial: 
  memory estimate:  20.45 MiB
  allocs estimate:  238751
  --------------
  minimum time:     17.675 ms (0.00% GC)
  median time:      20.564 ms (7.95% GC)
  mean time:        20.652 ms (7.97% GC)
  maximum time:     25.163 ms (6.69% GC)
  --------------
  samples:          242
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

# AFTER:
julia> @benchmark groupby(dt1, [:v1, :v2])
BenchmarkTools.Trial: 
  memory estimate:  122.64 KiB
  allocs estimate:  21
  --------------
  minimum time:     63.710 μs (0.00% GC)
  median time:      69.333 μs (0.00% GC)
  mean time:        76.896 μs (5.47% GC)
  maximum time:     971.994 μs (91.85% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(dt2, [:v1, :v2])
BenchmarkTools.Trial: 
  memory estimate:  520.89 KiB
  allocs estimate:  29
  --------------
  minimum time:     148.724 μs (0.00% GC)
  median time:      163.476 μs (0.00% GC)
  mean time:        188.497 μs (8.29% GC)
  maximum time:     1.161 ms (74.62% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark groupby(dt3, [:v1, :v2, :v1_1, :v2_1])
BenchmarkTools.Trial: 
  memory estimate:  16.56 MiB
  allocs estimate:  36
  --------------
  minimum time:     7.442 ms (0.00% GC)
  median time:      8.863 ms (5.16% GC)
  mean time:        8.971 ms (7.86% GC)
  maximum time:     13.647 ms (22.29% GC)
  --------------
  samples:          556
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This ensures the loop over rows is type stable, which increases performance
a lot.
@nalimilan nalimilan mentioned this pull request Feb 16, 2017
This increases performance a bit more.
Using convert() avoids creating a copy when the column is already
categorical. Also hoist the access to the refs field and avoid adding
anynull - 1 at each iteration. This makes the code a bit faster.
Use @inbounds where applicable. Avoid unnecessary copies.
@nalimilan
Copy link
Member Author

Closing in favor of #17.

@nalimilan nalimilan closed this Feb 22, 2017
@nalimilan nalimilan deleted the nl/groupby branch February 22, 2017 14:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant