Skip to content
This repository was archived by the owner on May 5, 2019. It is now read-only.

WIP: Consolidate hcat, append, merge, vcat and split functional overlaps #70

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/DataTables.jl
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ export @~,
SubDataTable,

aggregate,
append,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I can see the point of implementing generic functions which exist in Base, I don't like the idea of creating append, with such a general name and so similar to append!. Especially since merge does the same thing.

by,
categorical!,
colwise,
Expand All @@ -52,6 +53,7 @@ export @~,
eachrow,
eltypes,
groupby,
merge,
melt,
meltdt,
names!,
Expand Down
30 changes: 19 additions & 11 deletions src/abstractdatatable/abstractdatatable.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ The following are normally implemented for AbstractDataTables:

* [`describe`](@ref) : summarize columns
* [`dump`](@ref) : show structure
* `hcat` : horizontal concatenation
* `merge` : horizontal concatenation
* `merge!` : horizontal concatenation, modifies first argument in-place
* `vcat` : vertical concatenation
* `names` : columns names
* [`names!`](@ref) : set columns names
Expand Down Expand Up @@ -649,22 +650,29 @@ without(dt::AbstractDataTable, c::Any) = without(dt, index(dt)[c])

##############################################################################
##
## Hcat / vcat
## merge/merge!/append/append!/vcat
##
##############################################################################

# hcat's first argument must be an AbstractDataTable
# Trailing arguments (currently) may also be NullableVectors, Vectors, or scalars.

# hcat! is defined in datatables/datatables.jl
# Its first argument (currently) must be a DataTable.
function Base.merge!(dt::AbstractDataTable, others::AbstractDataTable...)
for other in others
for (i, c) in enumerate(add_names(names(dt), names(other)))
dt[c] = other[i]
end
end
return dt
end

# catch-all to cover cases where indexing returns a DataTable and copy doesn't
Base.hcat(dt::AbstractDataTable, x) = hcat!(dt[:, :], x)
Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable) = hcat!(dt1[:, :], dt2)
Base.merge(dt::AbstractDataTable, dtn::AbstractDataTable...) = merge!(dt[:, :], dtn...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deepcopy would be more natural, even if the result is the same?


function Base.append!(dt1::AbstractDataTable, x::AbstractVector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a one-liner here and below.

merge!(dt1, DataTable(Any[x]))
end

Base.hcat(dt::AbstractDataTable, x, y...) = hcat!(hcat(dt, x), y...)
Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable, dtn::AbstractDataTable...) = hcat!(hcat(dt1, dt2), dtn...)
function append(dt1::AbstractDataTable, x::AbstractVector)
merge(dt1, DataTable(Any[x]))
end

@generated function promote_col_type(cols::AbstractVector...)
elty = Base.promote_eltype(cols...)
Expand Down
2 changes: 1 addition & 1 deletion src/abstractdatatable/reshape.jl
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,7 @@ function unstack(dt::AbstractDataTable, colkey::Int, value::Int)
dt2[j][i] = valuecol[k]
end
end
hcat(dt1, dt2)
merge!(dt1, dt2)
end

unstack(dt::AbstractDataTable) = unstack(dt, :id, :variable, :value)
Expand Down
49 changes: 2 additions & 47 deletions src/datatable/datatable.jl
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@ dt1[:, [1,3]]
dt1[1:4, :]
dt1[1:4, :C]
dt1[1:4, :C] = 40. * dt1[1:4, :C]
[dt1; dt2] # vcat
[dt1 dt2] # hcat
vcat(dt1, dt2)
merge(dt1, dt2)
size(dt1)
```

Expand Down Expand Up @@ -635,15 +635,6 @@ function Base.insert!(dt::DataTable, col_ind::Int, item, name::Symbol)
insert!(dt, col_ind, upgrade_scalar(dt, item), name)
end

function Base.merge!(dt::DataTable, others::AbstractDataTable...)
for other in others
for n in _names(other)
dt[n] = other[n]
end
end
return dt
end

##############################################################################
##
## Copying
Expand Down Expand Up @@ -717,31 +708,6 @@ function deleterows!(dt::DataTable, ind::AbstractVector{Int})
dt
end

##############################################################################
##
## Hcat specialization
##
##############################################################################

# hcat! for 2 arguments
function hcat!(dt1::DataTable, dt2::AbstractDataTable)
u = add_names(index(dt1), index(dt2))
for i in 1:length(u)
dt1[u[i]] = dt2[i]
end
return dt1
end
hcat!(dt::DataTable, x::AbstractVector) = hcat!(dt, DataTable(Any[x]))

# hcat! for 1-n arguments
hcat!(dt::DataTable) = dt
hcat!(a::DataTable, b, c...) = hcat!(hcat!(a, b), c...)

# hcat
Base.hcat(dt::DataTable, x) = hcat!(copy(dt), x)
Base.hcat(dt1::DataTable, dt2::AbstractDataTable) = hcat!(copy(dt1), dt2)
Base.hcat(dt1::DataTable, dt2::AbstractDataTable, dtn::AbstractDataTable...) = hcat!(hcat(dt1, dt2), dtn...)

##############################################################################
##
## Nullability
Expand Down Expand Up @@ -787,17 +753,6 @@ function categorical!(dt::DataTable, compact::Bool=true)
dt
end

function Base.append!(dt1::DataTable, dt2::AbstractDataTable)
_names(dt1) == _names(dt2) || error("Column names do not match")
eltypes(dt1) == eltypes(dt2) || error("Column eltypes do not match")
ncols = size(dt1, 2)
# TODO: This needs to be a sort of transaction to be 100% safe
for j in 1:ncols
append!(dt1[j], dt2[j])
end
return dt1
end

function Base.convert(::Type{DataTable}, A::AbstractMatrix)
n = size(A, 2)
cols = Vector{Any}(n)
Expand Down
6 changes: 5 additions & 1 deletion src/groupeddatatable/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,11 @@ function combine(ga::GroupApplied)
idx[j + (1:n)] = gd.idx[start]
j += n
end
hcat!(gd.parent[idx, gd.cols], valscat)
if isa(valscat, DataTable)
Copy link
Contributor Author

@cjprybol cjprybol May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I broke this during my changes it helped me understand how combine and my changes to aggregate in #65 differ. This hcat! call was handling both DataTables and Vectors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand. valscat = vcat(ga.vals) and ga.vals is necessarily a Vector{<:AbstractDataTable}. So valscat is always an AbstractDataTable. Why do you need to call append! when it's not a DataTable?

return merge!(gd.parent[idx, gd.cols], valscat)
else
return append!(gd.parent[idx, gd.cols], valscat)
end
end


Expand Down
6 changes: 4 additions & 2 deletions src/other/index.jl
Original file line number Diff line number Diff line change
Expand Up @@ -123,9 +123,11 @@ Base.getindex(x::AbstractIndex, idx::AbstractVector{Symbol}) = [x.lookup[i] for
# Helpers

function add_names(ind::Index, add_ind::Index)
u = names(add_ind)
add_names(_names(ind), names(add_ind))
end

seen = Set(_names(ind))
function add_names(a::Vector{Symbol}, u::Vector{Symbol})
seen = Set(a)
dups = Int[]

for i in 1:length(u)
Expand Down
32 changes: 18 additions & 14 deletions test/cat.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ module TestCat
using DataTables

#
# hcat
# merge
#

nvint = NullableArray(Nullable{Int}[1, 2, Nullable(), 4])
Expand All @@ -14,35 +14,39 @@ module TestCat
dt4 = convert(DataTable, [1:4 1:4])
dt5 = DataTable(Any[NullableArray([1,2,3,4]), nvstr])

dth = hcat(dt3, dt4)
dth = merge(dt3, dt4)
@test size(dth, 2) == 3
@test names(dth) == [:x1, :x1_1, :x2]
@test isequal(dth[:x1], dt3[:x1])
@test isequal(dth, [dt3 dt4])
@test isequal(dth, DataTables.hcat!(DataTable(), dt3, dt4))
@test isequal(dth, merge(dt3, dt4))
@test isequal(dth, merge!(DataTable(), dt3, dt4))

dth3 = hcat(dt3, dt4, dt5)
dth3 = merge(dt3, dt4, dt5)
@test names(dth3) == [:x1, :x1_1, :x2, :x1_2, :x2_1]
@test isequal(dth3, hcat(dth, dt5))
@test isequal(dth3, DataTables.hcat!(DataTable(), dt3, dt4, dt5))
@test isequal(dth3, merge(dth, dt5))
@test isequal(dth3, merge!(DataTable(), dt3, dt4, dt5))

@test isequal(dt2, DataTables.hcat!(dt2))
@test isequal(dt2, merge!(dt2))

@testset "hcat ::AbstractDataTable" begin
@testset "merge ::AbstractDataTable" begin
dt = DataTable(A = repeat('A':'C', inner=4), B = 1:12)
gd = groupby(dt, :A)
answer = DataTable(A = fill('A', 4), B = 1:4, A_1 = 'B', B_1 = 5:8, A_2 = 'C', B_2 = 9:12)
@test hcat(gd...) == answer
@test merge(gd...) == answer
answer = answer[1:4]
@test hcat(gd[1], gd[2]) == answer
@test merge(gd[1], gd[2]) == answer
end

@testset "hcat ::Vectors" begin
@testset "append ::Vectors" begin
dt = DataTable()
DataTables.hcat!(dt, NullableCategoricalVector(1:10))
append!(dt, NullableCategoricalVector(1:10))
@test isequal(dt[1], NullableCategoricalVector(1:10))
DataTables.hcat!(dt, NullableArray(1:10))
append!(dt, NullableArray(1:10))
@test isequal(dt[2], NullableArray(1:10))
dt2 = append(dt, collect(1:10))
@test isequal(dt2[3], collect(1:10))
@test ncol(dt) == 2
@test ncol(dt2) == 3
end

#
Expand Down
4 changes: 2 additions & 2 deletions test/datatable.jl
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ module TestDataTable
dt = DataTable(a=[1, 2], b=[3., 4.])
dt2 = DataTable(b=["a", "b"], c=[:c, :d])
@test isequal(merge!(dt, dt2), dt)
@test isequal(dt, DataTable(a=[1, 2], b=["a", "b"], c=[:c, :d]))
@test isequal(dt, DataTable(a=[1, 2], b=[3., 4.], b_1=["a", "b"], c=[:c, :d]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was another point of confusion. Depending on which method you used to horizontally concatenate, you could either have columns overwrite one another or have column names updated so duplicates are unique. Is everyone ok with defaulting to not overwriting column names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic merge(!) methods use the value from the last argument in case of conflict, so I think we should respect this. It could make sense to also support passing a combine function.

Maybe that's an argument to keep hcat, with a different behavior. For that function, the situation is very similar to that faced by NamedArray or AxisArray: duplicate column names are not allowed, so either an error must be thrown, or names have to be made unique. I would suggest throwing an error by default, with a boolean keyword argument to make names unique when needed; but other solutions can be considered (we could discuss that with the authors of the two named array packages).


#test_group("Empty DataTable constructors")
dt = DataTable(Nullable{Int}, 10, 3)
Expand Down Expand Up @@ -322,7 +322,7 @@ module TestDataTable
dt = DataTable(A = 1:10, B = 'A':'J')
@test !(dt[:,:] === dt)

@test append!(DataTable(A = 1:2, B = 1:2), DataTable(A = 3:4, B = 3:4)) == DataTable(A=1:4, B = 1:4)
@test vcat(DataTable(A = 1:2, B = 1:2), DataTable(A = 3:4, B = 3:4)) == DataTable(A=1:4, B = 1:4)
dt = DataTable(A = NullableArray(1:3), B = NullableArray(4:6))
@test all(c -> isa(c, NullableArray), categorical!(deepcopy(dt)).columns)
@test all(c -> isa(c, NullableCategoricalArray), categorical!(deepcopy(dt), [1,2]).columns)
Expand Down