Port to NullableArrays and CategoricalArrays #1008

nalimilan · 2016-07-03T16:33:44Z

Completely remove support for DataArrays.

Still rough, but passes the tests on Julia 0.5.

datnamer · 2016-07-03T17:35:12Z

Does this improve speed and type stability?

nalimilan · 2016-07-03T17:58:23Z

Type-stability: yes. Speed: haven't measured anything yet, but it should.

ViralBShah · 2016-07-03T18:00:00Z

I did write a very simple sum benchmark using NullableArrays and DataArrays yesterday. NullableArrays was within 2x of Arrays, and 10x faster than DataArrays. I would expect this to carry over, but of course, much better and systematic benchmarking is necessary.

quinnj · 2016-07-04T00:41:33Z

Word. I'm pumped for this. I'll try to find some time in the next few days to pull the branch and play around.

quinnj · 2016-07-06T14:09:15Z

src/abstractdataframe/join.jl

-function DataArrays.PooledDataVecs(df1::AbstractDataFrame,
-                                   df2::AbstractDataFrame)
+# FIXME: rename this, and possibly move to CategoricalArrays.jl
+function PooledDataVecs{S,N}(v1::NullableCategoricalArray{S,N},


Yeah, I think it would be good to rename this as it's clearly a reference to PooledDataArrays. The operation here is a join!/union!/merge!, right?

Yeah, I'm not sure about the best name. Maybe join!. Ideally, we would also avoid accessing the refs field from the PooledDataVecs(::AbstractDataFrame, ::AbstractDataFrame) method below. But I'm even less sure what the API should look like in CategoricalArrays.jl. Could still improve this in steps.

dmbates · 2016-07-15T20:30:17Z

With the RDA stuff removed you can probably avoid using GZip in src/DataFrames.jl. I included that specifically for reading compressed .rda files.

ViralBShah · 2016-07-17T14:09:17Z

What holds up the merging here? If a few things need to be moved out, that can be handled in a different PR. I say we merge this if basically this does the right things.

ararslan · 2016-07-17T16:57:43Z

Milan is on vacation, which I assume is at least part of the holdup.

ViralBShah · 2016-07-17T16:58:30Z

I missed that. Thanks.

dmbates · 2016-07-28T19:37:45Z

I find myself stumbling over DataArrays in doing some analyses and was wondering how this branch is coming along. Are there things that others can do to help?

nalimilan · 2016-07-28T21:15:43Z

This branch needs some cleanup before merging, but I think we should also wait until we provide convenience APIs for lifting semantics. Without these, the port will be a major regression in terms of usability. I would certainly be useful if people could experiment in that direction in particular in DataFramesMeta, jplyr, NullableArrays and NullableUtils, as @davidagold is already doing.

dmbates · 2016-07-29T14:44:04Z

@nalimilan I have been working on an analysis in Julia of some benchmarking results, which is why I found myself tripping over DataArrays. The analysis is in TimingResults As you can see I used DataFramesMeta in addition to DataFrames. I started using Gadfly for plots but still find it difficult to implement the plot I am thinking of. I ended up reverting to lattice graphics accessed through RCall, because I know lattice graphics quite well.

My conclusion from this exercise is that the single most effective change we can make regarding DataFrames is not to create a NullableArray or DataArray unless it is necessary. It is fine to consider lifting semantics, etc. but, when there are no missing data in a column, creating a missing-data-enabled type then sweating over how to handle missing data values that are never present is what J. Edwards Deming called "burning the toast and then scraping it". Why go to all that trouble and, in the process, slow down the computation if you don't need to? Functions such as readtable should allow for missing data, of course, but should return Arrays when there are no missing data values in a column. Reading a feather file using the Feather package should do the same thing. RCall conversions of R data.frames should do this too. I would go one step further and suggest adding a function (I haven't decided on a good name for it) that takes a DataFrame and converts any NullableArray{T} columns to Array{T} if there are no nulls. Same for NullableNominalArray, etc. The completecases and completecases! functions should also convert nullable structures to plain arrays.

By the way, @nalimilan, I applaud your decision to create separate types NominalArray and NullableNominalArray, etc. in the CategoricalArrays package. This is exactly the distinction we should make.

As an exercise, consider the proportion of columns with missing data in the data sets you encounter. For me, at least, columns with missing data are the exception rather than the rule.

Regarding DataFramesMeta I took a while to learn it and now find it quite useful. Is there a reason to also mention jplyr? Anyone is free to create their own package - this is open source software - but the JuliaStats group should try to settle on one approach to a given problem, not multiple incompatible approaches. (The selection of verbs in DataFramesMeta follows the LINQ/SQL tradition whereas jplyr follows dplyr). Unless there is something wonderful in jplyr that cannot be implemented in DataFramesMeta the approach should be to improve DataFramesMeta, not create another incompatible package. As an aside, @davidagold , a name like jplyr, while informative to those who know dplyr, removes flexibility. You are bound by the name to emulate the approach of dplyr, whether it fits in well with Julia or not, and you are always playing catch up in the implementation.

So, what am I missing? Should I be creating my plots using the Plots package or ...?

johnmyleswhite · 2016-07-29T15:04:35Z

@dmbates: I think the problem is that DataFramesMeta doesn't fully solve the problems you're describing as the moment, but somewhat exacerbates them: its existing commitment to non-lazy DAG accumulation means that we can't use it to generate SQL -- which is the main example we have of backend-independence. We could change those semantics in that package, but it would completely break all existing code using that package.

David and I have begun preparing a document we'll share soon about the semantics we hope all tables will agree upon. Assuming we can actually get everyone on board, we could fix the situation by ensuring that every downstream package works only with a standardized abstract interface that multiple backends could implement. Part of the purpose of this document is to point out that we can implement the dplyr semantics with very high performance and fully automatic lifting regardless of historical Julia backends: we could make the current DataFrames fast or make the new version with NullableArrays fast. The only thing we really need is to maintain the tradition of always keeping the values array and is-null mask array separate. In cases where we intend to keep a non-nullable column in a table, I think we're better off not using Array objects, but instead adding boolean flags to every NullableArray that specify allownulls and anynull. That would simplify implementations, which will need to have access to the internal representation of columns. We can always work around that, but I don't think it will buy us much.

As for Gadfly, I think the best thing we could do there is remove the canonical interface and focus on the heretical one, which would assume that inputs are specified as AbstractArray objects. Then the tabular data interface issue just becomes one of ensuring that a table can always provide access to its columns as AbstractArray objects.

tshort · 2016-07-29T15:11:04Z

I agree with @dmbates on preferring Arrays where possible. Towards that end, I included a PassThrough type with abbreviation P in DataFramesMeta to try to avoid creation of DataArrays where they aren't needed.

With regards to jplyr, it's quite new and promising. I wouldn't want to discourage development or exploration. It's approach has advantages over DataFramesMeta. DataFramesMeta was always meant to be a playground.

dmbates · 2016-07-29T15:35:46Z

@nalimilan Thanks for the clarification on how DataFramesMeta can be improved and is being improved in jplyr. I appreciate the comments by @tshort too.

My comments on the name jplyr still hold. It is a name that will be recognized by those with dplyr experience but it is not informative to others. (plyr is not meaningful to those outside the HadleyVerse.) Furthermore, it seems to claim that jplyr is dplyr in Julia - nothing more, nothing less. That's a difficult expectation to fulfill and also an unnecessary constraint. Please consider renaming it.

I still think it is important to prefer Arrays when possible. I will read your document with interest when it is available but also, I'm sorry to say, with a bit of skepticism. There is no free lunch. Saying that all columns of a DataFrame must be of a missing-value-enabled type but perhaps with an extra tag that indicates they don't really have any missing data means that they are no longer arrays and you are back at the lifting semantics. To me that still seems like burning the toast and scraping it. And, it seems to me, it is unnecessary. Why take an Array and make it into something that is no longer an Array and must be massaged to get it back to being an Array so you can operate on it?

I am not infallible by any means and the whole Julia project has already pulled a very big rabbit out of the hat by creating a language with dynamic types where methods can be compiled to machine code. So I could be pleasantly surprised and I hope I will be. But if your objective is to create fully automatic lifting that is as fast as not doing any lifting at all, that seems to be a big ask.

davidagold · 2016-07-29T16:09:53Z

@dmbates The name alludes to Hadley's notion of dplyr as a "grammar of data manipulation", rather than a specific design spec. It's just a working project title -- I intend to change it. Does anybody have any suggestions? One thought I've had is Select.jl.

dmbates · 2016-07-29T16:21:29Z

@davidagold Before there was dplyr there was plyr, which was a generalization of the various *apply functions in R and a play on the word "plyer". But functions like lapply and friends are called map in other languages, including Julia.

I regret that I don't have a good suggestion for a name at this point but I will certainly speak up if something comes to mind.

dmbates · 2016-07-29T18:42:22Z

I just noticed that I replied to a comments by @johnmyleswhite as if they were by @nalimilan. Sorry for the confusion.

I must admit to being discouraged that, after over 4 years of development, DataFrames still has some pretty glaring problems. I will be giving a presentation at the Joint Statistical Meetings on Sunday and, for all the good aspects of Julia, I can't yet tell the audience "yes, you really should be using Julia for data analysis". Maybe the grand vision thing will work out. Maybe it will be another case of "the best is the enemy of the good". I don't think that potential users will wait forever to find out.

davidanthoff · 2016-10-04T05:03:43Z

I now have a branch in Query.jl that works with the current DataFrames master, i.e. with the NullableArrays version. See queryverse/Query.jl#62.

nalimilan · 2016-10-04T09:47:10Z

@davidanthoff How does Query.jl work as regards Nullable lifting? Can you e.g. compute a mean of a NullableArray conveniently? Select observations based on conditions on NullableArrays?

nalimilan · 2016-10-04T09:51:46Z

BTW, there's now a blog post on StructuredQueries.jlg, but it's more about the design than a simple introduction for users. We also need to address the issue of lifting to offer a decent replacement for the current DataFrames features.

davidagold · 2016-10-04T17:28:29Z

StructuredQueries (SQ) occupies a strange position in that it provides a user-facing interface but no backend support. The former is intended to be extended by packages that define a particular semantics. I don't intend for users to do using StructuredQueries. I expect for them to do, say using X, where X reexports SQ and implements a backend support ("collection machinery" in my lingo). I'm still working out how this will look in practice. I think if users want a ready-to-go data manipulation framework right now, they're better off using DataFramesMeta or Query.

johnmyleswhite · 2016-10-04T19:53:28Z

Can we plan out a timeline for tagging this release? Given the roughness of our API offering, can we hold off for, say, 2-3 months? I think this is clearly the future, but I think we want to make sure that we're ready to tell everyone to query Query.jl or DataFramesMeta.jl or StructuredQueries.jl.

davidanthoff · 2016-10-05T22:38:01Z

@nalimilan I don't want Query to provide any special lifting support, it is meant to just pick up whatever there is in base. Right now I define a whole bunch of methods that provide lifted implementations for things like + etc., but I want to remove those in the future once we have that in base.

tkelman · 2016-10-06T00:37:10Z

Which is known as type piracy and changes the behavior of unrelated code. Packages should not be doing that.

davidanthoff · 2016-10-06T00:46:31Z

@tkelman Totally agreed, that stuff should all be moved to base. Both NullableArrays and Query are guilty of this right now, as far as I can tell.

nalimilan · 2016-10-06T09:24:58Z

Yeah, we need to move these into Base, and have a package for 0.4 and 0.5 which only provides lifted operators, so that it's clear to the user what happens.

But without automatic lifting for other functions, I'm afraid working with Query.jl is going to be quite painful. Time will tell...

davidanthoff · 2016-10-07T00:44:22Z

Yeah, we need to move these into Base, and have a package for 0.4 and 0.5 which only provides lifted operators, so that it's clear to the user what happens.

Could we just back port these things to 0.5, and just not support this stuff on 0.4?

tkelman · 2016-10-07T01:01:47Z

We don't backport behavior changes, only bug fixes.

johnmyleswhite · 2016-10-07T01:08:45Z

I entirely agree that we should not bother supporting 0.4.

ararslan · 2016-10-07T01:17:36Z

DataFrames master doesn't, so we can't anyway. ¯_(ツ)_/¯

davidanthoff · 2016-10-07T03:56:49Z

Who is going to create the package that hosts the nullable methods for julia 0.5? Is it going to be hosted under JuliaStats?

Probably easier if we start with adding these methods in the package, and then once that is done we can move them all into julia base master in one go, right?

Would be great if someone could create the package so that we can start contributing, I'd be eager to get this moving.

ararslan · 2016-10-07T04:34:18Z

As Tony said, packages shouldn't be engaging in type piracy, so perhaps we should just starting making base PRs?

davidanthoff · 2016-10-07T04:53:36Z

@ararslan I believe there was agreement (including from @tkelman) that in this case, for julia 0.5 compat only, there would be one package that would have lifted methods and engage in type piracy. That package would only be used on julia 0.5, and the same methods would then be included in julia base for 0.6.

Essentially that package would just get all the type piracy that is currently going on in NullableArrays and in Query.

If we don't do that, I really don't see how we can possibly tag the new DataFrames before julia 0.6 is released.

ararslan · 2016-10-07T05:00:20Z

Oh sorry, I didn't realize that such an agreement had been reached.

Looks like this is what we're after, then: https://github.com/davidagold/NullableOps.jl

nalimilan · 2016-10-07T08:54:31Z

Probably easier if we start with adding these methods in the package, and then once that is done we can move them all into julia base master in one go, right?

AFAIK everything is done since I filed JuliaLang/julia#16988. I just need to open a new PR with non-boolean operators and see how the discussion goes.

davidagold · 2016-10-07T14:14:21Z

@nalimilan I don't think that's the PR you mean. Where's your PR that introduces lifted operators?

nalimilan · 2016-10-07T14:36:35Z

Indeed, I've fixed the link!

davidanthoff · 2016-10-07T17:36:13Z

I'm currently working on a macro that might make adding these lifted methods much easier. The general idea is this:

@lift function log(x)

would for example add a lifted method. The whole thing would work much more general:

@lift function +(x::Number, y::Number)
@lift function +{T<:Number}(x::T,y::T)

etc. should all work.

This macro could be used to then implement another macro called @lifted that could be used like the @generated macro to easily add lifted versions of a function at the function def site:

@lifted foo(x::Float64)
    return blabla
end

would automatically add lifted methods to foo.

It is not done, I'm just mentioning this here in case someone else is working on a similar thing, then we should merge effort.

tkelman · 2016-10-12T10:24:53Z

Doing lifting through selective manual method extension seems like you're repeating the dot-operator mistakes (and @vectorize_1arg as well). It's also fragile due to the global side-effects of modifying method tables. The approach taken in JuliaLang/julia#18758 seems safer, more general, and will have more localized effects.

Port to NullableArrays and CategoricalArrays

Wilfred · 2017-10-19T13:46:20Z

This was merged in September 2016, but as far as I can see version 0.10.1 (August 2017) doesn't have this. Is there a plan for releasing?

nalimilan · 2017-10-19T16:18:02Z

This PR has actually been merged, then moved to DataTables, and then replaced with a port to Nulls.jl before being "backported" to DataFrames. See this Discourse post for the full story. We're currently preparing a release.

nalimilan mentioned this pull request Jul 5, 2016

Cannot use NullableArray in a data frame #933

Closed

quinnj reviewed Jul 6, 2016
View reviewed changes

johnmyleswhite mentioned this pull request Oct 4, 2016

NullableArrays + New API Release Roadmap #1092

Closed

4 tasks

alyst mentioned this pull request Oct 4, 2016

Enhance joining and grouping #850

Closed

davidagold mentioned this pull request Oct 6, 2016

Status of this package JuliaStats/NullableStats.jl#2

Open

nalimilan mentioned this pull request Nov 1, 2016

column types depend on how dataframes are declared #1091

Closed

davidanthoff mentioned this pull request Apr 3, 2017

Unable to build docs #1180

Closed

nalimilan pushed a commit that referenced this pull request Jul 8, 2017

Merge pull request #1008 from JuliaStats/nl/nullable

c03d516

Port to NullableArrays and CategoricalArrays

Port to NullableArrays and CategoricalArrays #1008

Port to NullableArrays and CategoricalArrays #1008

Conversation

nalimilan commented Jul 3, 2016 • edited Loading

datnamer commented Jul 3, 2016

nalimilan commented Jul 3, 2016

ViralBShah commented Jul 3, 2016

quinnj commented Jul 4, 2016

quinnj Jul 6, 2016

Choose a reason for hiding this comment

nalimilan Jul 6, 2016

Choose a reason for hiding this comment

dmbates commented Jul 15, 2016

ViralBShah commented Jul 17, 2016

ararslan commented Jul 17, 2016

ViralBShah commented Jul 17, 2016

dmbates commented Jul 28, 2016

nalimilan commented Jul 28, 2016

dmbates commented Jul 29, 2016

johnmyleswhite commented Jul 29, 2016

tshort commented Jul 29, 2016

dmbates commented Jul 29, 2016

davidagold commented Jul 29, 2016

dmbates commented Jul 29, 2016

dmbates commented Jul 29, 2016

davidanthoff commented Oct 4, 2016

nalimilan commented Oct 4, 2016

nalimilan commented Oct 4, 2016

davidagold commented Oct 4, 2016

johnmyleswhite commented Oct 4, 2016

davidanthoff commented Oct 5, 2016

tkelman commented Oct 6, 2016

davidanthoff commented Oct 6, 2016

nalimilan commented Oct 6, 2016

davidanthoff commented Oct 7, 2016

tkelman commented Oct 7, 2016

johnmyleswhite commented Oct 7, 2016

ararslan commented Oct 7, 2016

davidanthoff commented Oct 7, 2016

ararslan commented Oct 7, 2016

davidanthoff commented Oct 7, 2016

ararslan commented Oct 7, 2016

nalimilan commented Oct 7, 2016 • edited Loading

davidagold commented Oct 7, 2016

nalimilan commented Oct 7, 2016

davidanthoff commented Oct 7, 2016

tkelman commented Oct 12, 2016 • edited Loading

Wilfred commented Oct 19, 2017

nalimilan commented Oct 19, 2017

nalimilan commented Jul 3, 2016 •

edited

Loading

nalimilan commented Oct 7, 2016 •

edited

Loading

tkelman commented Oct 12, 2016 •

edited

Loading