-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timings compared to new grouping code meant for DataFrames #3
Comments
Inspired by your use of https://github.com/JuliaStats/DataFramesMeta.jl/blob/ts/grouping/src/df-replacements.jl#L136-L166 In particular, it helped to separate the code that loops through the vector into its own function. That helped Julia figure out types. |
Interesting. I find the very slow timings quite surprising, as I remember doing some careful optimization when I wrote the code. I wonder whether I recently introduced a type instability when porting to 0.4, for which I hacked a workaround for the one-dimensional table case. I'll have a deeper look next week or so. |
I've had a look at this, and indeed my code suffered from type instability. Not sure how I missed that. Anyway, now the timings are much better on git master, and even faster than DataFramesMeta Anyway, I have some code here to use DataFramesMeta when a Times are for a second run, on 0.5. To easy copy/paste, the gist is here: https://gist.github.com/nalimilan/905624dd5f44b4c020d57c16fcaab498 julia> using DataFrames,DataFramesMeta, FreqTables
julia> n=1000_000
1000000
julia> y=ASCIIString[string("id",i) for i in rand(1:10,n)];
julia> x=rand(1:10,n);
julia> @time pda=PooledDataArray(y,UInt8);
0.445467 seconds (999.53 k allocations: 24.075 MB)
julia> @time f=freqtable(x);
0.033819 seconds (81 allocations: 5.047 KB)
julia> @time f=freqtable(y);
0.207490 seconds (2.00 M allocations: 45.783 MB)
julia> @time f=freqtable(pda);
0.003743 seconds (47 allocations: 3.016 KB)
julia> @time f=freqtable(x, pda);
2.345581 seconds (4.00 M allocations: 91.574 MB, 48.86% gc time)
julia> d=DataFrame(x=P(x),y=P(y),pda=pda);
julia> @time @by(d, :x, N=length(:x));
0.268315 seconds (1.01 M allocations: 57.985 MB, 15.64% gc time)
julia> @time @by(d, :y, N=length(:x));
0.520084 seconds (1.01 M allocations: 57.986 MB, 9.09% gc time)
julia> @time @by(d, :pda, N=length(:x));
0.077855 seconds (12.55 k allocations: 25.328 MB, 7.45% gc time)
julia> @time @by(d, (:x, :pda), N=length(:x));
1.034521 seconds (4.01 M allocations: 98.190 MB, 7.37% gc time)
UPDATE: With new fixes for the general case, the timings are now always better than DataFramesMeta, except when crossing a PDA with an array. This use case isn't the most interesting IMHO, though I could try doing something about it. |
I was intrigued by the use of
ht_keyindex
, so I compared timings offreqtable
to some new DataFrames code I'm experimenting with (see JuliaData/DataFrames.jl#894). Feel free to close this issue; I just thought you might like to see the timings.freqtable
is allocating quite a bit.Here are timings from FreqTables:
freqtable(x)
: 15.6 secsfreqtable(y)
: 16.7 secsfreqtable(pda)
: 1.5 secsfreqtable(x,pda)
: 26 secsThe equivalent timings from DataFramesMeta:
freqtable(x)
: 0.36 secsfreqtable(y)
: 8.3 secsfreqtable(pda)
: 0.38 secsfreqtable(x,pda)
: 2.4 secsHere is an edited transcript:
The text was updated successfully, but these errors were encountered: