Add benchmarks from R's data.table benchmarks #36

tshort · 2015-11-13T20:51:08Z

These benchmarks are from my favorite R package:

https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping

We are way behind. All of the DataFrame grouping operations are at least 10X slower than data.table. On the plus side, the syntax compares favorably with either data.table or dplyr.

Add benchmarks from R's data.table benchmarks

alyst · 2015-11-13T21:10:10Z

There's my PR JuliaData/DataFrames.jl#850 to address joining and grouping stability. Would try to post the results for this benchmark, when I will have little more time. Also it would be nice if someone reviews that PR at some point; it' I'm using it on daily basis for 10^6 rows datasets, because the multicolumn join in master have some troubles.

As for syntax, IMO for joins dplyr is a little bit better: writing inner_join() is more readable than join(.... kind=:inner); also the requirement that a) the joined column names must match and b) have to be specified explicitly is somewhat redundant and at times annoying.

tshort · 2015-11-13T22:03:00Z

@alyst, hi! I already tried your branch on these benchmarks. I was hopeful that by using Arrays, your PR would be faster, but it wasn't any faster. We're still quite slow. We probably have type stability issues. I think there's a lot of good stuff in your PR, though. Over the weekend, I will try to post to that issue to discuss more.

alyst · 2015-11-13T22:27:28Z

@tshort Thanks! It would be nice to pinpoint the lines, where these inefficiencies occur. I mentioned in that PR one case, where with the help of memory profiling I have identified the line that caused serious regression, and it was actually not even in DataFrames, but DataArrays.

I haven't run the new benchmark yet, but we have to take into account that it's synthetic, so it tests a few situations, but not all. For example, there are cases where the master would be faster than my PR, just because you don't need to build the hash for single-column join.

For a more comprehensive outlook, it would be nice to have the benchmark based on some moderately sized DB (i.e.several tables, multicolumn keys, ~10^8 rows).

nalimilan · 2017-03-06T21:08:44Z

@cjprybol These benchmarks could be interesting to have in DataTables to measure the speed of the new implementation from JuliaData/DataTables.jl#17 compared with data.table.

Add timings from R's data.table benchmarks

dd90af1

tshort added a commit that referenced this pull request Nov 13, 2015

Merge pull request #36 from JuliaStats/ts/data-table-timings

00d2b4a

Add benchmarks from R's data.table benchmarks

tshort merged commit 00d2b4a into master Nov 13, 2015

tshort deleted the ts/data-table-timings branch March 31, 2016 01:41

nalimilan mentioned this pull request Mar 7, 2017

Enhance joining and grouping JuliaData/DataTables.jl#17

Merged

tshort restored the ts/data-table-timings branch June 15, 2017 02:04

pdeffebach deleted the ts/data-table-timings branch September 27, 2021 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks from R's data.table benchmarks #36

Add benchmarks from R's data.table benchmarks #36

tshort commented Nov 13, 2015

alyst commented Nov 13, 2015

tshort commented Nov 13, 2015

alyst commented Nov 13, 2015

nalimilan commented Mar 6, 2017

Add benchmarks from R's data.table benchmarks #36

Add benchmarks from R's data.table benchmarks #36

Conversation

tshort commented Nov 13, 2015

alyst commented Nov 13, 2015

tshort commented Nov 13, 2015

alyst commented Nov 13, 2015

nalimilan commented Mar 6, 2017