Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timings compared to new grouping code meant for DataFrames #3

Open
tshort opened this issue Nov 30, 2015 · 3 comments
Open

Timings compared to new grouping code meant for DataFrames #3

tshort opened this issue Nov 30, 2015 · 3 comments

Comments

@tshort
Copy link

tshort commented Nov 30, 2015

I was intrigued by the use of ht_keyindex, so I compared timings of freqtable to some new DataFrames code I'm experimenting with (see JuliaData/DataFrames.jl#894). Feel free to close this issue; I just thought you might like to see the timings. freqtable is allocating quite a bit.

using DataFrames,DataFramesMeta
using FreqTables
n=10_000_000
y=[string("id",i) for i in rand(1:10,n)];
x=rand(1:10,n);
@time pda=PooledDataArray(y,UInt8);
d=DataFrame(x=P(x),y=P(y),pda=pda);

Here are timings from FreqTables:

freqtable(x) : 15.6 secs
freqtable(y) : 16.7 secs
freqtable(pda) : 1.5 secs
freqtable(x,pda) : 26 secs

The equivalent timings from DataFramesMeta:

freqtable(x) : 0.36 secs
freqtable(y) : 8.3 secs
freqtable(pda) : 0.38 secs
freqtable(x,pda) : 2.4 secs

Here is an edited transcript:

julia> using DataFrames,DataFramesMeta

julia> using FreqTables

julia> n=10_000_000
10000000

julia> y=[string("id",i) for i in rand(1:10,n)];

julia> x=rand(1:10,n);

julia> @time pda=PooledDataArray(y,UInt8);
  5.779360 seconds (20.19 M allocations: 401.625 MB, 18.34% gc time)

julia> @time f=freqtable(x)
 15.603074 seconds (120.68 M allocations: 2.264 GB, 21.57% gc time)
10-element NamedArrays.NamedArray{Int64,1,Array{Int64,1},Tuple{Dict{Int64,Int64}}}
1  999111
2  998631
3  1000254
4  1000311
5  1000433
6  1000310
7  1001538
8  999610
9  1000517
10 999285


julia> @time f=freqtable(y)
 16.711632 seconds (130.15 M allocations: 2.391 GB, 17.37% gc time)
10-element NamedArrays.NamedArray{Int64,1,Array{Int64,1},Tuple{Dict{ByteString,Int64}}}
id1  1001500
id10 1001339
id2  998962
id3  1000494
id4  998943
id5  997610
id6  1001378
id7  999609
id8  999185
id9  1000980


julia> @time f=freqtable(pda)
  1.533855 seconds (20.27 M allocations: 317.998 MB, 31.14% gc time)
10-element NamedArrays.NamedArray{Int64,1,Array{Int64,1},Tuple{Dict{ByteString,Int64}}}
id1  1001500
id10 1001339
id2  998962
id3  1000494
id4  998943
id5  997610
id6  1001378
id7  999609
id8  999185
id9  1000980


julia> @time f=freqtable(x,pda)
 26.057288 seconds (170.19 M allocations: 3.886 GB, 11.94% gc time)
10x10 NamedArrays.NamedArray{Int64,2,Array{Int64,2},Tuple{Dict{Int64,Int64},Dict{ByteString,Int64}}}
Dim1 \ Dim2 id1    id10   id2    id3      id6    id7    id8    id9
1           100063 100237 99591  99190    100529 100125 99834  100022
2           100004 99640  99586  100140   100252 99735  100073 99410
3           99753  100040 100119 100105   100021 100308 99679  100057
4           100907 99607  100086 99991    99869  100125 100177 100274
5           100523 100327 100114 100446   100679 99593  99815  99843
6           100028 100773 100367 100116   99554  100015 99684  99877
7           100194 100236 99505  100534   100135 100268 100320 100515
8           100069 100080 99976  99531    99368  99635  100247 100916
9           99941  100095 100321 100263   100257 99765  99773  100261
10          100018 100304 99297  100178   100714 100040 99583  99805

julia> d=DataFrame(x=P(x),y=P(y),pda=pda);

julia> @time DataFramesMeta.@aggregate(DataFramesMeta.groupby(d,:x),count=length(:pda))
  0.361674 seconds (577 allocations: 85.866 MB, 56.09% gc time)
10x2 DataFrames.DataFrame
| Row | x  | count   |
|-----|----|---------|
| 1   | 1  | 999111  |
| 2   | 2  | 998631  |
| 3   | 3  | 1000254 |
| 4   | 4  | 1000311 |
| 5   | 5  | 1000433 |
| 6   | 6  | 1000310 |
| 7   | 7  | 1001538 |
| 8   | 8  | 999610  |
| 9   | 9  | 1000517 |
| 10  | 10 | 999285  |

julia> @time DataFramesMeta.@aggregate(DataFramesMeta.groupby(d,:y),count=length(:pda))
  8.267301 seconds (20.22 M allocations: 517.192 MB, 42.02% gc time)
10x2 DataFrames.DataFrame
| Row | y      | count   |
|-----|--------|---------|
| 1   | "id1"  | 1001500 |
| 2   | "id10" | 1001339 |
| 3   | "id2"  | 998962  |
| 4   | "id3"  | 1000494 |
| 5   | "id4"  | 998943  |
| 6   | "id5"  | 997610  |
| 7   | "id6"  | 1001378 |
| 8   | "id7"  | 999609  |
| 9   | "id8"  | 999185  |
| 10  | "id9"  | 1000980 |

julia> @time DataFramesMeta.@aggregate(DataFramesMeta.groupby(d,:pda),count=length(:pda))
  0.377314 seconds (580 allocations: 85.866 MB, 59.59% gc time)
10x2 DataFrames.DataFrame
| Row | pda    | count   |
|-----|--------|---------|
| 1   | "id1"  | 1001500 |
| 2   | "id10" | 1001339 |
| 3   | "id2"  | 998962  |
| 4   | "id3"  | 1000494 |
| 5   | "id4"  | 998943  |
| 6   | "id5"  | 997610  |
| 7   | "id6"  | 1001378 |
| 8   | "id7"  | 999609  |
| 9   | "id8"  | 999185  |
| 10  | "id9"  | 1000980 |

julia> @time r=DataFramesMeta.@aggregate(DataFramesMeta.groupby(d,[:x,:pda]),count=length(:y));
  2.363423 seconds (9.95 M allocations: 380.801 MB, 12.60% gc time)

julia> unstack(r,:pda,:count)
10x11 DataFrames.DataFrame
| Row | x  | id1    | id10   | id2    | id3    | id4    | id5    | id6    |
|-----|----|--------|--------|--------|--------|--------|--------|--------|
| 1   | 1  | 100063 | 100237 | 99591  | 99190  | 99871  | 99649  | 100529 |
| 2   | 2  | 100004 | 99640  | 99586  | 100140 | 100149 | 99642  | 100252 |
| 3   | 3  | 99753  | 100040 | 100119 | 100105 | 100163 | 100009 | 100021 |
| 4   | 4  | 100907 | 99607  | 100086 | 99991  | 99778  | 99497  | 99869  |
| 5   | 5  | 100523 | 100327 | 100114 | 100446 | 99635  | 99458  | 100679 |
| 6   | 6  | 100028 | 100773 | 100367 | 100116 | 100118 | 99778  | 99554  |
| 7   | 7  | 100194 | 100236 | 99505  | 100534 | 99528  | 100303 | 100135 |
| 8   | 8  | 100069 | 100080 | 99976  | 99531  | 99618  | 100170 | 99368  |
| 9   | 9  | 99941  | 100095 | 100321 | 100263 | 100006 | 99835  | 100257 |
| 10  | 10 | 100018 | 100304 | 99297  | 100178 | 100077 | 99269  | 100714 |

| Row | id7    | id8    | id9    |
|-----|--------|--------|--------|
| 1   | 100125 | 99834  | 100022 |
| 2   | 99735  | 100073 | 99410  |
| 3   | 100308 | 99679  | 100057 |
| 4   | 100125 | 100177 | 100274 |
| 5   | 99593  | 99815  | 99843  |
| 6   | 100015 | 99684  | 99877  |
| 7   | 100268 | 100320 | 100515 |
| 8   | 99635  | 100247 | 100916 |
| 9   | 99765  | 99773  | 100261 |
| 10  | 100040 | 99583  | 99805  |
@tshort
Copy link
Author

tshort commented Nov 30, 2015

Inspired by your use of ht_keyindex, I wrote a constructor for PooledDataArrays that's more than twice as fast as the normal constructor. It does it by doing less dict lookups as your code does. See here for my code:

https://github.com/JuliaStats/DataFramesMeta.jl/blob/ts/grouping/src/df-replacements.jl#L136-L166

In particular, it helped to separate the code that loops through the vector into its own function. That helped Julia figure out types.

@nalimilan
Copy link
Owner

Interesting. I find the very slow timings quite surprising, as I remember doing some careful optimization when I wrote the code. I wonder whether I recently introduced a type instability when porting to 0.4, for which I hacked a workaround for the one-dimensional table case. I'll have a deeper look next week or so.

@nalimilan
Copy link
Owner

nalimilan commented Apr 22, 2016

I've had a look at this, and indeed my code suffered from type instability. Not sure how I missed that. Anyway, now the timings are much better on git master, and even faster than DataFramesMeta for the PDA case (almost no allocations!). Though some cases are still slower, I need to investigate why.

Anyway, I have some code here to use DataFramesMeta when a DataFrame is passed to freqtable. This limits code duplication a lot, and will make it easier to support any kind of data source (including SQL databases). I'll push this when it's ready.

Times are for a second run, on 0.5. To easy copy/paste, the gist is here: https://gist.github.com/nalimilan/905624dd5f44b4c020d57c16fcaab498

julia> using DataFrames,DataFramesMeta, FreqTables

julia> n=1000_000
1000000

julia> y=ASCIIString[string("id",i) for i in rand(1:10,n)];


julia> x=rand(1:10,n);

julia> @time pda=PooledDataArray(y,UInt8);
  0.445467 seconds (999.53 k allocations: 24.075 MB)

julia> @time f=freqtable(x);
  0.033819 seconds (81 allocations: 5.047 KB)

julia> @time f=freqtable(y);
  0.207490 seconds (2.00 M allocations: 45.783 MB)

julia> @time f=freqtable(pda);
  0.003743 seconds (47 allocations: 3.016 KB)

julia> @time f=freqtable(x, pda);
  2.345581 seconds (4.00 M allocations: 91.574 MB, 48.86% gc time)

julia> d=DataFrame(x=P(x),y=P(y),pda=pda);

julia> @time @by(d, :x, N=length(:x));
  0.268315 seconds (1.01 M allocations: 57.985 MB, 15.64% gc time)

julia> @time @by(d, :y, N=length(:x));
  0.520084 seconds (1.01 M allocations: 57.986 MB, 9.09% gc time)

julia> @time @by(d, :pda, N=length(:x));
  0.077855 seconds (12.55 k allocations: 25.328 MB, 7.45% gc time)

julia> @time @by(d, (:x, :pda), N=length(:x));
  1.034521 seconds (4.01 M allocations: 98.190 MB, 7.37% gc time)

UPDATE: With new fixes for the general case, the timings are now always better than DataFramesMeta, except when crossing a PDA with an array. This use case isn't the most interesting IMHO, though I could try doing something about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants