-
Notifications
You must be signed in to change notification settings - Fork 17
Switch to DataTables #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d79d1c7
to
61a9ea7
Compare
JuliaData/DataFrames.jl#1008 is not tagged yet, what is the best way to use the current master of |
I don't think you can... not sure though. I bet @tkelman would know. |
src/convert.jl
Outdated
_zero_nas{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v] | ||
|
||
# convert R factor into NullableCategoricalArray{String} | ||
# TODO option to convert into Symbol etc? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think converting to symbols would be useful. The strings are stored in a pool anyway, so sharing storage doesn't gain us much, and symbols aren't really supposed to be used for user data.
src/convert.jl
Outdated
end | ||
|
||
function sexp2julia(rex::RSEXPREC) | ||
warn("Conversion of $(typeof(rex)) to Julia is not implemented") | ||
return nothing | ||
end | ||
|
||
# FIXME remove when anynull(NullableCategoricalArray) would be available | ||
_anynull{T,N,R}(A::NullableCategoricalArray{T,N,R}) = any(A.refs == zero(R)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean .==
? Anyway, you can avoid an allocation using any(x -> x!=0, A.refs)
.
Adding anynull(x::NullableCategoricalArray)
would be a good idea, though that will require adding NullableArrays
as a dependency. Another approach would be to see whether we can get any(isnull, A)
to be optimized automatically. Anyway, PR welcome in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/convert.jl
Outdated
|
||
# convert nullable array without nulls into non-nullable array | ||
# `A` is expected to contain no nulls | ||
_dropnonulls(A::NullableArray) = A.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already exists in NullableArrays.jl, it's called dropnull
(and it should also be added to CategoricalArrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different from dropnull()
, it's for simple cases when there are no NAs, so no searches/(re)allocations are required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Yet the name is confusing, as it definitely doesn't drop non-null values.
src/convert.jl
Outdated
nas = namask(rv) | ||
hasna = any(nas) | ||
# FIXME forceNullable option to always convert to NullableArray | ||
jv = _nullable(rv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new approach seems a bit wasteful, since you build a Nullable(Categorical)Array
before potentially converting it back to a non-nullable array. You could continue checking namask
first, and only build the nullable array if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that _nullable()
handles the creation of a nullable vector of appropriate type (categorical/normal) and if it turns out there are no nulls, the conversion to a non-null version is done by _dropnonulls()
(no reallocation needed). This would be handy when/if forceNullable
is introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could still pass force_nullable=false
to that function, and check for that before creating the nullable array.
test/RDA.jl
Outdated
df[:cplx] = Complex128[1.1+0.5im, 1.0im] | ||
@test isequal(sexp2julia(load("$testdir/data/types.rda",convert=false)["df"]), df) | ||
@test isequal(sexp2julia(load("$testdir/data/types_ascii.rda",convert=false)["df"]), df) | ||
|
||
df[2, :] = NA | ||
for col in DataFrames.columns(df) | ||
col[2] = Nullable{eltype(col)}() # FIXME nullify!() is not supported by CategoricalArrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Doesn't df[2, :] = Nullable()
work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, thanks for the hint. Not used to the capabilities of NullableArrays yet.
test/RDA.jl
Outdated
append!(df, df[2, :]) | ||
df[3, :num] = NaN | ||
df[:, :cplx] = @data [NA, @compat(Complex128(1,NaN)), NaN] | ||
df[:, :cplx] = NullableArray{Complex128}(Nullable{Complex128}[Nullable{Complex128}(), Complex128(1,NaN), NaN]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you drop support for 0.4, NullableArray([Nullable(), Complex128(1,NaN), NaN])
should work.
test/RDA.jl
Outdated
@test isequal(sexp2julia(load("$testdir/data/NAs.rda",convert=false)["df"]), df) | ||
# ASCII format saves NaN as NA | ||
df[3, :num] = NA | ||
df[:, :cplx] = @data [NA, NA, NA] | ||
df[3, :num] = Nullable{Complex128}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nullable()
is enough here and below.
61a9ea7
to
5f14a8b
Compare
@nalimilan Thanks for the review! I've updated the PR. |
src/convert.jl
Outdated
hasna = any(nas) | ||
# TODO dimnames? | ||
# FIXME forceNullable option to always convert to NullableArray | ||
jv = nullable_vector(rv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think you could rename nullable_vector
to something like vector
, and have it return a non-nullable vector if no nulls are present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullable_vector()
just extracts the data and NA mask and packages it into appropriate type, the real conversion logic is implemented by sexp2julia()
that takes into account the current context and user-specified options. I was thinking about returning the tuple of data and NA mask and constructing the appropriate vector/nullable vector in sexp2julia()
, but it would not work for categorical arrays, because the pool of categories also has to be returned somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass hasna
to _nullable_vector
and make return type choices there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second thought... yes, that way it is definitely better. Thanks for your persistence ;)
src/convert.jl
Outdated
# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0 | ||
na2zero{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v] | ||
|
||
# convert R factor into NullableCategoricalArray{String} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also converts to a NullableArrays
in some cases apparently (but in when can that happen ?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've clarified the description, thanks.
0a71e9f
to
f03c9ce
Compare
@ararslan This PR should remove v0.4 support already (from REQUIRE and CI configs). I'm also not sure whether we should patch Travis and AppVeyor scripts just for the sake of seeing the green icon. The tests are passing for me locally. But thanks for the info. |
3b81838
to
12f071b
Compare
Could be worth it while you make updates to this branch, then you can just remove the commit before it's merged. |
c74d185
to
6882872
Compare
6882872
to
dc0bbab
Compare
Updated the PR to use |
# - julia --check-bounds=yes -e 'Pkg.clone(pwd()); Pkg.build("RData"); Pkg.test("RData"; coverage=true)' | ||
script: | ||
- if [[ -a .git/shallow ]]; then git fetch --unshallow; fi | ||
- julia --check-bounds=yes -e 'Pkg.clone(pwd()); Pkg.build("RData"); Pkg.checkout("DataTables", "master"); Pkg.test("RData"; coverage=true)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you still need the checkout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be necessary since there's a tag for DataTables.
I'm not sure DataTables is ready yet. Maybe we should wait until more query frameworks work with it. |
This is another situation where I'm sad to see DataFrames support go. While it would be awesome to be able to use a table abstraction package here, so that one can create either a |
I guess both could be supported (without reexporting the packages' APIs), but we would still need to choose a default. |
Would If |
5252fd1
to
fb9bf08
Compare
I think at some point only one package should remain (probably DataTables, though maybe not based on |
That's true. I'm just a little bit worried that the community efforts would be diffused by maintaining the interchangeability of the two package families that provide essentially the same functionality. |
src/convert.jl
Outdated
end | ||
|
||
# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0 | ||
na2zero{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ifelse
could be faster than a branch here.
src/convert.jl
Outdated
# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0 | ||
na2zero{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v] | ||
|
||
# convert to [Nullable]CategoricalArray{String} if `ri`is a factor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space before "is".
src/convert.jl
Outdated
# convert to [Nullable]CategoricalArray{String} if `ri`is a factor, | ||
# or to [Nullable]Array{Int32} otherwise | ||
function julia_vector(ri::RIntegerVector, force_nullable::Bool) | ||
!isfactor(ri) && return _julia_vector(eltype(ri.data), ri, force_nullable) # not a factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# not a factor
is kind of redundant with the check.
src/convert.jl
Outdated
# FIXME set ordered flag | ||
refs = na2zero(REFTYPE, ri.data) | ||
pool = CategoricalPool{String, REFTYPE}(rlevels) | ||
(force_nullable || (findfirst(refs, zero(REFTYPE)) > 0)) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be more readable as an if.. else
block.
src/convert.jl
Outdated
namask(ri::RIntegerVector) = BitArray(ri.data .== R_NA_INT32) | ||
namask(rn::RNumericVector) = BitArray(map(isna_float64, reinterpret(UInt64, rn.data))) | ||
namask(ri::RVector{Int32}) = [i == R_NA_INT32 for i in ri.data] | ||
namask(rn::RNumericVector) = map(isna_float64, reinterpret(UInt64, rn.data)) | ||
# if re or im is NA, the whole complex number is NA | ||
# FIXME avoid temporary Vector{Bool} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment no longer applies since NullableArray
currently uses a Vector{Bool}
. Though the current approach is wasteful when there are no nulls, since it allocates the bit mask even though it's not used.
It appears the best solution would be to add DataStreams support to RData so that the result can be converted to any type, including @quinnj said he's going to work on DataStreams soon, which should improve the abstraction and make this possible/easier. |
fb9bf08
to
64da6be
Compare
* drop Julia 0.4 support (since DataTables require Julia 0.5) * convert from using DataArrays to NullableArrays/CategoricalArrays
- use == instead of isequal() - explicitly make the columns nullable
64da6be
to
57aef24
Compare
Convert from using DataArrays to NullableArrays and NullableCategoricalArrays.
Fixes #19, closes #17.