Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@where does not preserve CategoricalArray #157

Open
alyst opened this issue Oct 11, 2017 · 7 comments
Open

@where does not preserve CategoricalArray #157

alyst opened this issue Oct 11, 2017 · 7 comments
Milestone

Comments

@alyst
Copy link
Contributor

alyst commented Oct 11, 2017

Julia v0.6, DataFrames, CategoricalArray and Query masters

julia> using DataFrames, Query

julia> df = DataFrame(a = categorical(["a", "b", "a", "c"]))
4×1 DataFrames.DataFrame
│ Row │ a │
├─────┼───┤
│ 1   │ a │
│ 2   │ b │
│ 3   │ a │
│ 4   │ c │

julia> typeof(df[:a])
CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}}

julia> dff = df |> @where(_.a != "b") |> DataFrame
3×1 DataFrames.DataFrame
│ Row │ a │
├─────┼───┤
│ 1   │ a │
│ 2   │ a │
│ 3   │ c │

julia> typeof(dff[:a])
Array{CategoricalArrays.CategoricalValue{String,UInt32},1}

IIUC Array{CategoricalValue} is suboptimal in terms of storage. Also CategoricalArrays API like levels(dff[:a]) is not available.

@nalimilan
Copy link

Indeed Array{CategoricalValue} isn't supposed to be used. It's kind of annoying that we don't have a way to specify that a given element type should always be stored in a given container type.

@alyst
Copy link
Contributor Author

alyst commented Oct 12, 2017

I have some minimal solution to this, by it still involves moving+extending Enumerable abstract type definitions from QueryOperators.jl to TableTraits.jl and using them from IterableTables.jl. So it's pretty invasive.
I can open PRs, but I suppose the core JuliaData contributors already have some strategy how to fix this and the other design problems (e.g. IterableTables.jl being dependent on every data-related package).

@davidanthoff
Copy link
Member

The proper way to implement this would be to make sure that at collection into a DataFrame any column of type CategoricalValue uses the proper container type. Essentially just another elseif case here. Having said that, I think anything like that would require a bit more thinking about larger implications. I was a couple of times close to just adding this, but in the end I'm just not yet sure whether I want this in the iterable tables interface or not (main tradeoff is complexity, I'm trying to keep iterable tables as simple as possible, and I'm not sure this would still fall into that category).

@davidanthoff
Copy link
Member

It's kind of annoying that we don't have a way to specify that a given element type should always be stored in a given container type.

YES, I completely agree. I think it would be absolutely fantastic if there was a standard way in base to say "for element type T, the default array type should be X", and if things like [a,b,c] (where a, b and c are instances of T) constructed an X, and and fill etc. all used that. I think a simple function like getdefaultarraytype(::Type{T}) where T = Array{T} in base would probably do it, where custom Ts could add a method and the various base array creation methods would rely on it.

I suggested something like that in JuliaLang/julia#22630, but probably doomed all chances of success of that proposal by embedding it in a crazy rename idea ;) I think a much less ambitious proposal along the lines of the above paragraph would still be a great idea.

@nalimilan
Copy link

Maybe we should just override similar(::Array, ::Type{<:CategoricalValue})?

@davidanthoff
Copy link
Member

Where would that be picked up?

I guess I was so far mostly worrying about [a,b,c] when a, b and c are of type CategoricalValue (or in my case DataValue) and how one can intercept that. I think we can already intercept CategoricalValue[a,b,c], and this line seems key to intercepting all cases.

I kind of feel that if someone is asking for an Array specifically, he/she should get that specific type, right? Not sure...

@nalimilan
Copy link

I kind of feel that if someone is asking for an Array specifically, he/she should get that specific type, right? Not sure...

In this case, the caller wouldn't require an Array, just an object "similar" to the input Array, so I'd say it wouldn't be absurd to return a CategoricalArray if the element type is a categorical value.

@davidanthoff davidanthoff added this to the Backlog milestone Aug 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants