Skip to content

Move CategoricalValue and CategoricalPool into separate package #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidanthoff opened this issue Apr 17, 2017 · 10 comments
Closed

Comments

@davidanthoff
Copy link

I'm starting to think about queryverse/IterableTables.jl#2, and the whole design would be a lot easier if packages could take a dependency on CategoricalValue without taking a dependency on the whole CategoricalArrays package. Maybe a package called CategoricalValues.jl would work?

@nalimilan
Copy link
Member

I'm really not a fan of this idea. The two parts are really tightly coupled, it would make further development more cumbersome to keep them synchronized.

What's the problem with depending on CategoricalArrays? It's a small pure-Julia dependency. I don't think you'd notice the size difference if we split the package in two parts.

@davidanthoff
Copy link
Author

My eventual plan for IterableTables is to move the integrations of various sources and sinks into the respective integrated packages. At that point packages like https://github.com/JuliaComputing/PooledArrays.jl and https://github.com/JuliaStats/DataArrays.jl would have to take a dependency on CategoricalArrays, which seems not ideal.

In some sense this is analog to the Nullable and NullableArrays situation: it is really good that one can use Nullable without having to take a dependency on NullableArrays.

@nalimilan
Copy link
Member

The difference with Nullable is that the type is defined in Base and it's very simple. OTOH CategoricalValue relies on CategoricalPool, so at that point you may as well include CategoricalArray in the package.

I don't really understand why PooledArrays or DataArrays would have to depend on CategoricalValue. Could you develop?

@davidanthoff
Copy link
Author

at that point you may as well include CategoricalArray in the package.

And that pulls in NullableArrays, and it would just be a lot nicer and cleaner if that didn't happen.

I don't really understand why PooledArrays or DataArrays would have to depend on CategoricalValue. Could you develop?

Actually, it would be DataFrames and IndexedTables that need to take on a dependency on CategoricalValue, not the array packages, sorry.

If the sink for DataFrames gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledDataArray for that column. Equally, if the sink for IndexedTables gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledArray for that column. So both DataFrames and IndexedTables would need to know CategoricalValue, but not CategoricalArray.

@nalimilan
Copy link
Member

And that pulls in NullableArrays, and it would just be a lot nicer and cleaner if that didn't happen.

If that's the problem, then it can easily be fixed. Actually, we've discussed getting rid of this dependency in the past, as it's only used for the return type of unique(::NullableCategoricalArray). We could return an Array{Nullable} there since the number of unique values isn't supposed to be very large.

If the sink for DataFrames gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledDataArray for that column. Equally, if the sink for IndexedTables gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledArray for that column. So both DataFrames and IndexedTables would need to know CategoricalValue, but not CategoricalArray.

I think we need a more general abstraction for CategoricalArray/PooledDataArray/PooledArray. CategoricalValue isn't the best way of streaming data from a CategoricalVector to a PDA, since it carries a lot of useless information for that purpose and is a very inefficient way to extract the integer code for a given entry. I think we would better define a way for columns to come with meta-data giving a mapping from integer code to levels, and a way to iterate over the integer codes. That would not only be useful to convert between these types, but also to read data from files in formats like Stata (which associates labels to codes). We also need a generic way to get the levels in order to create model matrices in StatsModels.

@davidanthoff
Copy link
Author

since it carries a lot of useless information for that purpose and is a very inefficient way to extract the integer code for a given entry.

Are you referring to the pointer to the pool?

I'll have to mull this whole area a bit more, it is not really clear to me what we want in this area... Especially when it comes to queries, it seems there is another lifting-like issue there, i.e. what to return if one applies some function onto a categorical value etc.

@nalimilan
Copy link
Member

Are you referring to the pointer to the pool?

Yes, but more generally it will be much more efficient to work directly with integer codes than creating a CategoricalValue (whatever its contents). Unless of course the compiler is able to optimize out the unnecessary operations, which I don't think occurs yet.

I'll have to mull this whole area a bit more, it is not really clear to me what we want in this area... Especially when it comes to queries, it seems there is another lifting-like issue there, i.e. what to return if one applies some function onto a categorical value etc.

Yeah, it's not very easy to decide what to do with operations on CategoricalValue. In general it seems automatic unwrapping should happen, except where a specific method is needed. In particular, it would make sense to have CategoricalValue{String} <: AbstractString if this kind of inheritance pattern was possible. Since it's not, I'm tempted to define CategoricalValue <: AbstractString, which will make it nicer to work with the most common case of CategoricalValue{String}. Then implementing the AbstractString interface with unwrapping methods should be all we need for a useful behavior.

@davidanthoff
Copy link
Author

Yes, but more generally it will be much more efficient to work directly with integer codes than creating a CategoricalValue (whatever its contents). Unless of course the compiler is able to optimize out the unnecessary operations, which I don't think occurs yet.

Hm, I would have thought that the compiler could optimize that away, but you probably have better insight into that than me. For the whole IterableTables and Query world we won't be able to stream just Ints, it will have to be some immutable in any case.

Here is a crazy idea: could the levels be encoded as a type parameter? Something like this:

immutable CategoricalValue{LEVELS}
    index::Int
end

getvalue{LEVELS}(a::CategoricalValue{LEVELS}) = string(LEVELS.parameters[a.index].parameters[1])

a = CategoricalValue{Tuple{Val{:level1},Val{:level2}}}(2)

println(getvalue(a))

It does feel like a gross abuse of the type system and I could easily imagine that this is a really terrible idea, but maybe worth investigating a bit?

I'm tempted to define CategoricalValue <: AbstractString, which will make it nicer to work with the most common case of CategoricalValue{String}. Then implementing the AbstractString interface with unwrapping methods should be all we need for a useful behavior.

I like that a lot.

@nalimilan
Copy link
Member

I think the way forward here is to start moving to Union{CategoricalValue, Null} instead of Nullable{CategoricalValue}, which will allow us to get rid of the dependency on NullableArrays. See JuliaData/DataTables.jl#62.

It does feel like a gross abuse of the type system and I could easily imagine that this is a really terrible idea, but maybe worth investigating a bit?

This would be essentially equivalent to creating an enum for each variable and storing an array of enum values. This can be efficient for some very specific use cases, but in general I don't think this would be a good idea as recompiling functions for each set of levels would be wasteful. This has been discussed before at JuliaStats/DataArrays.jl#50 and JuliaStats/DataArrays.jl#73.

@Nosferican
Copy link
Contributor

This seems a bit moot since the small unions era.

@nalimilan nalimilan closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants