-
Notifications
You must be signed in to change notification settings - Fork 371
groupby() and InexactError (again) #985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Good catch, though I wonder how often this can happen in practice. Have you observed this with real data? I guess we could drop unused levels from I would have sworn DataArrays included a function to drop unused levels and recode the values accordingly, but it doesn't seem to exist. Not very hard to write, though. Would you give it a try? |
I only find bugs when they arise in real world data! In my case, it was with a groupby() over 5 columns, one of which had about 1500 unique values, and the other 4 were each less than 100 unique values. in total, there are about half a million unique groups in the data, even though the cartesian product is on the order of 15 billion. |
By the way, I don't think dropping unused levels before groupsort_indexer will work. The InexactError() I get happens on line 106, which is strictly before the call to groupsort_indexer. |
I seem to be running into same issue, is there a suggested (simple) hack other than running this section of our pipeline through R? |
I don't think so. Fixing it shouldn't require too much work, but that's not trivial change either. |
Isn't this going to be fixed under the CategoricalArrays branch? |
Not really, as that's mostly orthogonal (have a look at the current code in master). If you want to help, looking at how Pandas handles this would make it easier for somebody to implement a solution. |
Something you can try is replacing |
I tried that before finding this issue, as it's been a long-standing problem with using julia for our data processing. We often (ad tech) have >100K row dataframes with 5 or more variables, one of which usually has about 5K-10K unique values, the others <100. We need to group and summarize the data and regularly run that part of the process through R which does the work in a blink. The julia workaround involves alot of extra code and loops, and adds significant runtime to the process. Using the example at https://gist.github.com/tcovert/6df691c5308e1ddd6c5103804cc2bb05, with UInt32 we get
and with UInt64
|
My simple workaround for the same issue when using the by_(d::AbstractDataFrame, cols, f::Function) = by(d, shift!(cols), isempty(cols) ? f : (x) -> by_(x, copy(cols), f))
by_(f::Function, d::AbstractDataFrame, cols) = by_(d, cols, f) Probably not as efficient as it could be but it got the job done. As a bonus, if you have some dimensions that you know can be grouped together without exceeding the limit, you can specify a "group-down" path, i.e.
and have fewer intermediate calls to |
Fixed in DataTables JuliaData/DataTables.jl#17 |
The fix in DataTables should be ported over here to address this issue. |
Yeah, but that would take quite some work that I'd rather put into improving the new framework. The priority for DataFrames is to have it work on Julia 0.6... |
Right, I just wanted to be sure this issue wasn't closed prematurely. |
Closing as fix has been ported over here |
I've found another way to break groupby(): pass it a DataFrame and set of columns for which the cartesian product is greater than 32 bits of addressable space, but for which the actual set of existing groups is smaller. This issue arises in line 106 of grouping.jl:
ngroups = ngroups * (length(dv.pool) + dv_has_nas)
It seems that JMW (or someone else) predicted this would eventually be a problem, as seen in the comment on the following line...
Here is an MWE: https://gist.github.com/tcovert/6df691c5308e1ddd6c5103804cc2bb05
The text was updated successfully, but these errors were encountered: