-
Notifications
You must be signed in to change notification settings - Fork 43
Add CategoricalVector
and collapse
#88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nightly failure is a Julia bug ( |
Julia bug is fixed now anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent work. I have only one substantial change to suggest, but it's nontrivial. Don't hesitate to ask for help if my comments were confusing.
I should also add that I've not really used the categorical capabilities of AxisArrays, so it would be great if a second set of eyeballs could take a peek here.
But this is really nice, thanks (preliminarily) for the contribution!
test/categoricalvector.jl
Outdated
v = collect(zip([:a, :b, :c][rand(1:3,20)], [:x,:y][rand(1:2,20)], [:x,:y][rand(1:2,20)])) | ||
idx = sortperm(v) | ||
A = AxisArray(data[idx,:], CategoricalVector(v[idx]), [:a, :b]) | ||
@test A[:b, :] == A[5:12, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume these reflect the actual random numbers produced due to srand(1234)
? Would it be clearer to just hardcode v
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would; I was just following test/sortedvector.jl as much as possible. I can change it though :)
REQUIRE
Outdated
@@ -1,4 +1,5 @@ | |||
julia 0.5 | |||
IntervalSets | |||
Iterators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use IterTools.jl
instead, which was created to avoid the name conflicts that arose from having both Base.Iterators
and the Iterators
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I'm actually in the middle of making PRs to transition everyone from Iterators to IterTools! Only missed this because it's still in my 0.5 directory.
src/combine.jl
Outdated
return _flatten(gca, As...; kwargs...) | ||
end | ||
|
||
function flatten(last_dim::Integer, As::AxisArray...; kwargs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only sad thing here is that the result of flatten
won't be inferrable. Perhaps you don't care, but if there could be an inferrable API, it might be nicer. If you're just comparing AxisArrays
, you should be able to split the axes into "common" and "distinct" subsets. For example,
using AxisArrays
# split out the first axis of each axisarray into a separate tuple
@inline selectcommon(As::AxisArray...) = selectcommon(axes.(As)...)
@inline selectcommon(axs::Tuple...) = _selectcommon(first.(axs), Base.tail.(axs))
# If the next axis has common name, keep it and then keep going
@inline _selectcommon(axes1::NTuple{N, Axis{name}}, trailingaxes::NTuple{N, Any}) where {N,name} = (axes1, _selectcommon(first.(trailingaxes), Base.tail.(trailingaxes))...)
# If they don't have the same Axis name, we drop the rest
@inline _selectcommon(axes1, trailingaxes) = ()
A = AxisArray(rand(3,3,3), :x, :y, :z)
B = AxisArray(rand(3,3,3), :x, :y, :t)
cmn = selectcommon(A, B)
should generate a tuple-of-tuples, each element that containing one common axis from each input AxisArray (in this case, it generates (xtuple, ytuple)
where xtuple
contains the two :x
axes and ytuple
contains the two :y
axes). You can get the (non-common) trailing axes with a call to Base.IteratorsMD.split
, just using the N
in the NTuple
for the length of cmn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to tweak this to add in the ability to limit the length of the common axes, maybe with Val{max_dim}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I fully understand the concern, can you point me to a use-case example?
If you really can't live without the arithmetic operations, then one can always define a MyVal
struct here. There's nothing remotely special about Val
per se, it's literally trivial. And there isn't any special support for it in inference.jl
either:
tim@diva:~/src/julia-1.0/base$ grep -F "Val{" inference.jl
tim@diva:~/src/julia-1.0/base$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@generated decrement(v::Type{Val{N}}) where N = :(Val{$(N-1)})
# split out the first axis of each axisarray into a separate tuple
@inline selectcommon(v::Type{Val{D}}, As::AxisArray...) where D = selectcommon(v, axes.(As)...)
@inline selectcommon(v::Type{Val{D}}, axs::Tuple...) where D = _selectcommon(v, first.(axs), Base.tail.(axs))
# If the next axis has common name, keep it and then keep going
@inline _selectcommon(v::Type{Val{D}}, axes1::NTuple{N, Axis{name}}, trailingaxes::NTuple{N, Any}) where {D,N,name} = (axes1, _selectcommon(decrement(v), first.(trailingaxes), Base.tail.(trailingaxes))...)
# If we have limited the axes to a certain number, stop
@inline _selectcommon(::Type{Val{0}}, axes1::NTuple{N, Axis{name}}, trailingaxes::NTuple{N, Any}) where {N,name} = ()
# If they don't have the same Axis name, we drop the rest
@inline _selectcommon(::Type{Val{D}}, axes1, trailingaxes) where D = ()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like reasonable code, I just don't understand what you're trying to achieve. Under what circumstance would you flatten with fewer common axes than are available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, I'll try to explain our use case.
In an internal package we have an appendable feature set type which is index by some set of common indices. At any point we may want the "flattened" version. At any point we may push a new array to the set. We want the result of flattening a particular feature set to have the same axes always. We wouldn't want it to return different numbers of dimensions when there are only N arrays (where they share an extra axis) as when there are N+1 arrays (where array N+1 does not share the extra axis).
Does that make sense? Do you have alternate suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that I really have two separate problems:
- Flatten all axes after the common axes
- Flatten all axes after axis N, ensuring axes 1:N are common
Trying to solve both in one function is creating strange mutant hybrid functions and causing increasing hacks. I think splitting these apart will be easier.
test/combine.jl
Outdated
A1 = AxisArray(A1data, Axis{:X}(1:2), Axis{:Y}(1:2)) | ||
A2 = AxisArray(reshape(A2data, size(A2data)..., 1), Axis{:X}(1:2), Axis{:Y}(1:2), Axis{:Z}([:foo])) | ||
|
||
@test flatten(A1, A2; array_names=[:A1, :A2]) == AxisArray(cat(3, A1data, A2data), Axis{:X}(1:2), Axis{:Y}(1:2), Axis{:page}(CategoricalVector([(:A1,), (:A2, :foo)]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do make an inferrable version of flatten
, it would be great to add @inferred
around those calls.
Thanks Tim! I'll work on getting the inferrable |
Nightly failures are #92 |
I updated this to only include the fixed-dimension flatten. Hopefully this can get merged and I can do the select-flatten later. This PR requires 0.6. I think it would be reasonable to have a separate PR to remove support for 0.5. What do you think? |
Sorry I didn't notice this earlier.
Either way would be fine, as long as it's a separate commit. |
Seems fine to me. I'll let you decide how to handle the 0.6 issue and then I think this can be merged. |
Can we merge #98 and this? |
src/categoricalvector.jl
Outdated
@@ -0,0 +1,85 @@ | |||
|
|||
export CategoricalVector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a different name for this, as CategoricalVector
is already used in CategoricalArrays, which replaced PooledDataArray
in DataTables (and soon in DataFrames). Why not CategoricalAxis
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an Axis
, and it mirrors the SortedVector
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, IIUC its only purpose is to treat it as a categorical axis, isn't it? Ideas about other possible names? It would be too bad to have conflicts when loading both AxisArrays and DataTables/DataFrames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could choose not to export it?
There aren't conflicts when you use both packages unless you also use CategoricalVector
.
The nomenclature used within AxisArrays is Categorical
, which is how CategoricalVector
came up.
How about DiscreteVector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the AxisArray code, but flatten
sounds a bit unexpected for this function. Base.Iterators.flatten
does something very different. Also, this seems to mix two operations which are not necessarily related: concatenation and flattening. Shouldn't the latter be implemented as cat
? Then flattening could be enabled as an option to cat
.
src/combine.jl
Outdated
flatten(::Type{Val{N}}, ::Type{NewArrayType}, As::AxisArray...) -> AxisArray | ||
flatten(::Type{Val{N}}, ::Type{NewArrayType}, labels::Tuple, As::AxisArray...) -> AxisArray | ||
|
||
Concatenates AxisArrays with N equal leading axes into a single AxisArray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backquotes around type names (here and elsewhere).
src/combine.jl
Outdated
* `As::AxisArray...`: AxisArrays to be flattened together. | ||
""" | ||
@generated function flatten{N, AN, LType, NewArrayType<:AbstractArray}( | ||
::Type{Val{N}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the conventional indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only two existing cases of too-long method signatures and making this one look like those would mean pushing these arguments themselves too far to the right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do it anyway for consistency, and knowing that with the new where
syntax the leading type parameters (which are the actual problem here) will be moved to the end of the line or to another line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the where
syntax already (updated now)
src/combine.jl
Outdated
end | ||
|
||
""" | ||
flatten(::Type{Val{N}}, As::AxisArray...) -> AxisArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add examples? I find it hard to understand what this function does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example which has been added:
julia> price_data = AxisArray(rand(10), Axis{:time}(Date(2016,01,01):Day(1):Date(2016,01,10)))
1-dimensional AxisArray{Float64,1,...} with axes:
:time, 2016-01-01:1 day:2016-01-10
And data, a 10-element Array{Float64,1}:
0.885014
0.418562
0.609344
0.72221
0.43656
0.840304
0.455337
0.65954
0.393801
0.260207
julia> size_data = AxisArray(rand(10,2), Axis{:time}(Date(2016,01,01):Day(1):Date(2016,01,10)), Axis{:measure}([:area, :volume]))
2-dimensional AxisArray{Float64,2,...} with axes:
:time, 2016-01-01:1 day:2016-01-10
:measure, Symbol[:area, :volume]
And data, a 10×2 Array{Float64,2}:
0.159434 0.456992
0.344521 0.374623
0.522077 0.313256
0.994697 0.320953
0.95104 0.900526
0.921854 0.729311
0.000922581 0.148822
0.449128 0.761714
0.650277 0.135061
0.688773 0.513845
julia> flattened = flatten(Val{1}, (:price, :size), price_data, size_data)
2-dimensional AxisArray{Float64,2,...} with axes:
:time, 2016-01-01:1 day:2016-01-10
:flat, Tuple{Symbol,Vararg{Symbol,N} where N}[(:price,), (:size, :area), (:size, :volume)]
And data, a 10×3 Array{Float64,2}:
0.885014 0.159434 0.456992
0.418562 0.344521 0.374623
0.609344 0.522077 0.313256
0.72221 0.994697 0.320953
0.43656 0.95104 0.900526
0.840304 0.921854 0.729311
0.455337 0.000922581 0.148822
0.65954 0.449128 0.761714
0.393801 0.650277 0.135061
0.260207 0.688773 0.513845
julia> flattened[Axis{:flat}(:size)] == size_data
true
The difference from Also, something I want to do but have not yet done is allow merging of the leading axes, similar to |
src/categoricalvector.jl
Outdated
export CategoricalVector | ||
|
||
""" | ||
A CategoricalVector is an AbstractVector which is treated as a categorical axis regardless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing backticks around types in this docstring.
CategoricalVector
and flatten
CategoricalVector
and collapse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A couple of tiny changes, and then someone should hit merge.
src/combine.jl
Outdated
flat_axis_eltype = _flat_axis_eltype(LType, trailing_axes) | ||
flat_axis_type = CategoricalVector{flat_axis_eltype, Vector{flat_axis_eltype}} | ||
|
||
new_axes_type = Tuple{new_common_axes..., Axis{:flat, flat_axis_type}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:collapsed
?
src/combine.jl
Outdated
end | ||
|
||
@generated function collapse{N, AN, LType}(::Type{Val{N}}, labels::NTuple{AN, LType}, As::Vararg{AxisArray, AN}) | ||
flat_dim = Val{N + 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used?
src/combine.jl
Outdated
collapse(::Type{Val{N}}, ::Type{NewArrayType}, labels::Tuple, As::AxisArray...) -> AxisArray | ||
|
||
Collapses `AxisArray`s with `N` equal leading axes into a single `AxisArray`. | ||
All additional axes in any of the arrays are flattened into a single additional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collapsed
src/combine.jl
Outdated
|
||
Collapses `AxisArray`s with `N` equal leading axes into a single `AxisArray`. | ||
All additional axes in any of the arrays are flattened into a single additional | ||
`CategoricalVector{Tuple}` axis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on the label of the new axis?
src/combine.jl
Outdated
arrays. The remaining axes are flattened. All `N` axes must be common | ||
to each input array, at the same dimension. Values from `0` up to the | ||
minimum number of dimensions across all input arrays are allowed. | ||
* `labels::Tuple`: (optional) a label for each `AxisArray` in `As` which is used in the flat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely clear. Maybe "Index assigned to each array in As
in the :collapsed axis"? Since it's optional, also specify what happens by default.
src/combine.jl
Outdated
julia> collapsed = collapse(Val{1}, (:price, :size), price_data, size_data) | ||
2-dimensional AxisArray{Float64,2,...} with axes: | ||
:time, 2016-01-01:1 day:2016-01-10 | ||
:flat, Tuple{Symbol,Vararg{Symbol,N} where N}[(:price,), (:size, :area), (:size, :volume)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be fixed if you change the name to :collapsed
Thanks for your patience, and of course for the contribution! |
Thank you for the attention!! :D |
CategoricalVector
allows any type of vector to be a Categorical axis.flatten
collapse
concatenates AxisArrays with equal leading axes into a single AxisArray. All additional axes in any of the arrays are flattened into a single additionalCategoricalVector{Tuple}
axisI also changed
axistrait
to operate on types, with a fallback that callstypeof
, in order to use it in@generated
functions.