groupby() and InexactError (again) #985

tcovert · 2016-06-01T16:03:17Z

I've found another way to break groupby(): pass it a DataFrame and set of columns for which the cartesian product is greater than 32 bits of addressable space, but for which the actual set of existing groups is smaller. This issue arises in line 106 of grouping.jl:

ngroups = ngroups * (length(dv.pool) + dv_has_nas)

It seems that JMW (or someone else) predicted this would eventually be a problem, as seen in the comment on the following line...

Here is an MWE: https://gist.github.com/tcovert/6df691c5308e1ddd6c5103804cc2bb05

The text was updated successfully, but these errors were encountered:

nalimilan · 2016-06-01T17:09:15Z

Good catch, though I wonder how often this can happen in practice. Have you observed this with real data?

I guess we could drop unused levels from dv before calling groupsort_indexer. This should probably be done only when ngroups is high, though, since that will require going over the whole vector. Maybe a good rule would be to do it when ngroups > length(dv), since groupsort_indexer needs to allocate and go over a vector with ngroups elements.

I would have sworn DataArrays included a function to drop unused levels and recode the values accordingly, but it doesn't seem to exist. Not very hard to write, though. Would you give it a try?

tcovert · 2016-06-01T17:22:05Z

I only find bugs when they arise in real world data! In my case, it was with a groupby() over 5 columns, one of which had about 1500 unique values, and the other 4 were each less than 100 unique values. in total, there are about half a million unique groups in the data, even though the cartesian product is on the order of 15 billion.

tcovert · 2016-06-01T20:10:09Z

By the way, I don't think dropping unused levels before groupsort_indexer will work. The InexactError() I get happens on line 106, which is strictly before the call to groupsort_indexer.

merl-dev · 2016-10-09T19:09:00Z

I seem to be running into same issue, is there a suggested (simple) hack other than running this section of our pipeline through R?

nalimilan · 2016-10-09T20:10:50Z

I don't think so. Fixing it shouldn't require too much work, but that's not trivial change either.

tcovert · 2016-10-09T20:33:25Z

Isn't this going to be fixed under the CategoricalArrays branch?

nalimilan · 2016-10-10T06:51:47Z

Not really, as that's mostly orthogonal (have a look at the current code in master). If you want to help, looking at how Pandas handles this would make it easier for somebody to implement a solution.

nalimilan · 2016-10-11T08:49:39Z

Something you can try is replacing UInt32 on line 96 with UInt64, which should hopefully be enough for a reasonable number of combinations.

merl-dev · 2016-10-11T12:21:32Z

I tried that before finding this issue, as it's been a long-standing problem with using julia for our data processing. We often (ad tech) have >100K row dataframes with 5 or more variables, one of which usually has about 5K-10K unique values, the others <100. We need to group and summarize the data and regularly run that part of the process through R which does the work in a blink. The julia workaround involves alot of extra code and loops, and adds significant runtime to the process.

Using the example at https://gist.github.com/tcovert/6df691c5308e1ddd6c5103804cc2bb05, with UInt32 we get

gb0 = groupby(df, [:v1, :v2, :v3])
ERROR: OutOfMemoryError()
 in groupsort_indexer(::Array{UInt32,1}, ::Int64, ::Bool) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:13
 in groupsort_indexer(::Array{UInt32,1}, ::Int64) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:5
 in groupby(::DataFrames.DataFrame, ::Array{Symbol,1}) at /home/ubuntu/.julia/v0.5/DataFrames/src/groupeddataframe/grouping.jl:109

julia> gb1 = groupby(df, [:v1, :v2, :v3, :v4])
ERROR: InexactError()
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at ./array.jl:415
 in groupby(::DataFrames.DataFrame, ::Array{Symbol,1}) at /home/ubuntu/.julia/v0.5/DataFrames/src/groupeddataframe/grouping.jl:104

and with UInt64

julia> gb0 = groupby(df, [:v1, :v2, :v3])
ERROR: OutOfMemoryError()
 in groupsort_indexer(::Array{UInt64,1}, ::Int64, ::Bool) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:13
 in groupsort_indexer(::Array{UInt64,1}, ::Int64) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:5
 in groupby(::DataFrames.DataFrame, ::Array{Symbol,1}) at /home/ubuntu/.julia/v0.5/DataFrames/src/groupeddataframe/grouping.j                                          l:109

julia> gb1 = groupby(df, [:v1, :v2, :v3, :v4])
ERROR: OutOfMemoryError()
 in groupsort_indexer(::Array{UInt64,1}, ::Int64, ::Bool) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:7
 in groupsort_indexer(::Array{UInt64,1}, ::Int64) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:5
 in groupby(::DataFrames.DataFrame, ::Array{Symbol,1}) at /home/ubuntu/.julia/v0.5/DataFrames/src/groupeddataframe/grouping.jl:109

joshbode · 2016-11-26T23:48:46Z

My simple workaround for the same issue when using the by function was to group each dimension sequentially, e.g.

by_(d::AbstractDataFrame, cols, f::Function) = by(d, shift!(cols), isempty(cols) ? f : (x) -> by_(x, copy(cols), f))
by_(f::Function, d::AbstractDataFrame, cols) = by_(d, cols, f)

Probably not as efficient as it could be but it got the job done. As a bonus, if you have some dimensions that you know can be grouped together without exceeding the limit, you can specify a "group-down" path, i.e.

by_(d, [[:x, :y], :z], f)

and have fewer intermediate calls to by

cjprybol · 2017-03-06T21:18:41Z

Fixed in DataTables JuliaData/DataTables.jl#17

ararslan · 2017-03-06T21:20:34Z

The fix in DataTables should be ported over here to address this issue.

nalimilan · 2017-03-06T21:23:52Z

Yeah, but that would take quite some work that I'd rather put into improving the new framework. The priority for DataFrames is to have it work on Julia 0.6...

ararslan · 2017-03-06T21:31:21Z

Right, I just wanted to be sure this issue wasn't closed prematurely.

quinnj · 2017-09-07T19:16:13Z

Closing as fix has been ported over here

nalimilan mentioned this issue Nov 2, 2016

InexactError in join #1120

Closed

cjprybol mentioned this issue Feb 11, 2017

rewrite groupby JuliaData/DataTables.jl#3

Closed

cjprybol mentioned this issue Aug 18, 2017

WIP: DataTables.jl Backport #1214

Closed

4 tasks

quinnj closed this as completed Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby() and InexactError (again) #985

groupby() and InexactError (again) #985

tcovert commented Jun 1, 2016 •

edited by nalimilan

Loading

nalimilan commented Jun 1, 2016

tcovert commented Jun 1, 2016

tcovert commented Jun 1, 2016

merl-dev commented Oct 9, 2016

nalimilan commented Oct 9, 2016

tcovert commented Oct 9, 2016

nalimilan commented Oct 10, 2016

nalimilan commented Oct 11, 2016

merl-dev commented Oct 11, 2016

joshbode commented Nov 26, 2016 •

edited

Loading

cjprybol commented Mar 6, 2017

ararslan commented Mar 6, 2017

nalimilan commented Mar 6, 2017

ararslan commented Mar 6, 2017

quinnj commented Sep 7, 2017

groupby() and InexactError (again) #985

groupby() and InexactError (again) #985

Comments

tcovert commented Jun 1, 2016 • edited by nalimilan Loading

nalimilan commented Jun 1, 2016

tcovert commented Jun 1, 2016

tcovert commented Jun 1, 2016

merl-dev commented Oct 9, 2016

nalimilan commented Oct 9, 2016

tcovert commented Oct 9, 2016

nalimilan commented Oct 10, 2016

nalimilan commented Oct 11, 2016

merl-dev commented Oct 11, 2016

joshbode commented Nov 26, 2016 • edited Loading

cjprybol commented Mar 6, 2017

ararslan commented Mar 6, 2017

nalimilan commented Mar 6, 2017

ararslan commented Mar 6, 2017

quinnj commented Sep 7, 2017

tcovert commented Jun 1, 2016 •

edited by nalimilan

Loading

joshbode commented Nov 26, 2016 •

edited

Loading