Skip to content

Add benchmarks from R's data.table benchmarks #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 13, 2015

Conversation

tshort
Copy link
Contributor

@tshort tshort commented Nov 13, 2015

These benchmarks are from my favorite R package:

https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping

We are way behind. All of the DataFrame grouping operations are at least 10X slower than data.table. On the plus side, the syntax compares favorably with either data.table or dplyr.

tshort added a commit that referenced this pull request Nov 13, 2015
Add benchmarks from R's data.table benchmarks
@tshort tshort merged commit 00d2b4a into master Nov 13, 2015
@alyst
Copy link
Contributor

alyst commented Nov 13, 2015

There's my PR JuliaData/DataFrames.jl#850 to address joining and grouping stability. Would try to post the results for this benchmark, when I will have little more time. Also it would be nice if someone reviews that PR at some point; it' I'm using it on daily basis for 10^6 rows datasets, because the multicolumn join in master have some troubles.

As for syntax, IMO for joins dplyr is a little bit better: writing inner_join() is more readable than join(.... kind=:inner); also the requirement that a) the joined column names must match and b) have to be specified explicitly is somewhat redundant and at times annoying.

@tshort
Copy link
Contributor Author

tshort commented Nov 13, 2015

@alyst, hi! I already tried your branch on these benchmarks. I was hopeful that by using Arrays, your PR would be faster, but it wasn't any faster. We're still quite slow. We probably have type stability issues. I think there's a lot of good stuff in your PR, though. Over the weekend, I will try to post to that issue to discuss more.

@alyst
Copy link
Contributor

alyst commented Nov 13, 2015

@tshort Thanks! It would be nice to pinpoint the lines, where these inefficiencies occur. I mentioned in that PR one case, where with the help of memory profiling I have identified the line that caused serious regression, and it was actually not even in DataFrames, but DataArrays.

I haven't run the new benchmark yet, but we have to take into account that it's synthetic, so it tests a few situations, but not all. For example, there are cases where the master would be faster than my PR, just because you don't need to build the hash for single-column join.

For a more comprehensive outlook, it would be nice to have the benchmark based on some moderately sized DB (i.e.several tables, multicolumn keys, ~10^8 rows).

@tshort tshort deleted the ts/data-table-timings branch March 31, 2016 01:41
@nalimilan
Copy link
Member

@cjprybol These benchmarks could be interesting to have in DataTables to measure the speed of the new implementation from JuliaData/DataTables.jl#17 compared with data.table.

@tshort tshort restored the ts/data-table-timings branch June 15, 2017 02:04
@pdeffebach pdeffebach deleted the ts/data-table-timings branch September 27, 2021 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants