-
Notifications
You must be signed in to change notification settings - Fork 371
Port to NullableArrays and CategoricalArrays #1008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Does this improve speed and type stability? |
Type-stability: yes. Speed: haven't measured anything yet, but it should. |
I did write a very simple sum benchmark using NullableArrays and DataArrays yesterday. NullableArrays was within 2x of Arrays, and 10x faster than DataArrays. I would expect this to carry over, but of course, much better and systematic benchmarking is necessary. |
Word. I'm pumped for this. I'll try to find some time in the next few days to pull the branch and play around. |
function DataArrays.PooledDataVecs(df1::AbstractDataFrame, | ||
df2::AbstractDataFrame) | ||
# FIXME: rename this, and possibly move to CategoricalArrays.jl | ||
function PooledDataVecs{S,N}(v1::NullableCategoricalArray{S,N}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it would be good to rename this as it's clearly a reference to PooledDataArrays. The operation here is a join!/union!/merge!, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure about the best name. Maybe join!
. Ideally, we would also avoid accessing the refs
field from the PooledDataVecs(::AbstractDataFrame, ::AbstractDataFrame)
method below. But I'm even less sure what the API should look like in CategoricalArrays.jl. Could still improve this in steps.
With the RDA stuff removed you can probably avoid |
What holds up the merging here? If a few things need to be moved out, that can be handled in a different PR. I say we merge this if basically this does the right things. |
Milan is on vacation, which I assume is at least part of the holdup. |
I missed that. Thanks. |
I find myself stumbling over DataArrays in doing some analyses and was wondering how this branch is coming along. Are there things that others can do to help? |
This branch needs some cleanup before merging, but I think we should also wait until we provide convenience APIs for lifting semantics. Without these, the port will be a major regression in terms of usability. I would certainly be useful if people could experiment in that direction in particular in DataFramesMeta, jplyr, NullableArrays and NullableUtils, as @davidagold is already doing. |
@nalimilan I have been working on an analysis in Julia of some benchmarking results, which is why I found myself tripping over DataArrays. The analysis is in TimingResults As you can see I used My conclusion from this exercise is that the single most effective change we can make regarding DataFrames is not to create a NullableArray or DataArray unless it is necessary. It is fine to consider lifting semantics, etc. but, when there are no missing data in a column, creating a missing-data-enabled type then sweating over how to handle missing data values that are never present is what J. Edwards Deming called "burning the toast and then scraping it". Why go to all that trouble and, in the process, slow down the computation if you don't need to? Functions such as By the way, @nalimilan, I applaud your decision to create separate types As an exercise, consider the proportion of columns with missing data in the data sets you encounter. For me, at least, columns with missing data are the exception rather than the rule. Regarding So, what am I missing? Should I be creating my plots using the |
@dmbates: I think the problem is that DataFramesMeta doesn't fully solve the problems you're describing as the moment, but somewhat exacerbates them: its existing commitment to non-lazy DAG accumulation means that we can't use it to generate SQL -- which is the main example we have of backend-independence. We could change those semantics in that package, but it would completely break all existing code using that package. David and I have begun preparing a document we'll share soon about the semantics we hope all tables will agree upon. Assuming we can actually get everyone on board, we could fix the situation by ensuring that every downstream package works only with a standardized abstract interface that multiple backends could implement. Part of the purpose of this document is to point out that we can implement the dplyr semantics with very high performance and fully automatic lifting regardless of historical Julia backends: we could make the current DataFrames fast or make the new version with NullableArrays fast. The only thing we really need is to maintain the tradition of always keeping the values array and is-null mask array separate. In cases where we intend to keep a non-nullable column in a table, I think we're better off not using As for Gadfly, I think the best thing we could do there is remove the canonical interface and focus on the heretical one, which would assume that inputs are specified as |
I agree with @dmbates on preferring With regards to jplyr, it's quite new and promising. I wouldn't want to discourage development or exploration. It's approach has advantages over DataFramesMeta. DataFramesMeta was always meant to be a playground. |
@nalimilan Thanks for the clarification on how My comments on the name I still think it is important to prefer I am not infallible by any means and the whole Julia project has already pulled a very big rabbit out of the hat by creating a language with dynamic types where methods can be compiled to machine code. So I could be pleasantly surprised and I hope I will be. But if your objective is to create fully automatic lifting that is as fast as not doing any lifting at all, that seems to be a big ask. |
@dmbates The name alludes to Hadley's notion of dplyr as a "grammar of data manipulation", rather than a specific design spec. It's just a working project title -- I intend to change it. Does anybody have any suggestions? One thought I've had is Select.jl. |
@davidagold Before there was I regret that I don't have a good suggestion for a name at this point but I will certainly speak up if something comes to mind. |
I just noticed that I replied to a comments by @johnmyleswhite as if they were by @nalimilan. Sorry for the confusion. I must admit to being discouraged that, after over 4 years of development, DataFrames still has some pretty glaring problems. I will be giving a presentation at the Joint Statistical Meetings on Sunday and, for all the good aspects of Julia, I can't yet tell the audience "yes, you really should be using Julia for data analysis". Maybe the grand vision thing will work out. Maybe it will be another case of "the best is the enemy of the good". I don't think that potential users will wait forever to find out. |
I now have a branch in Query.jl that works with the current |
@davidanthoff How does Query.jl work as regards |
BTW, there's now a blog post on StructuredQueries.jlg, but it's more about the design than a simple introduction for users. We also need to address the issue of lifting to offer a decent replacement for the current DataFrames features. |
StructuredQueries (SQ) occupies a strange position in that it provides a user-facing interface but no backend support. The former is intended to be extended by packages that define a particular semantics. I don't intend for users to do |
Can we plan out a timeline for tagging this release? Given the roughness of our API offering, can we hold off for, say, 2-3 months? I think this is clearly the future, but I think we want to make sure that we're ready to tell everyone to query Query.jl or DataFramesMeta.jl or StructuredQueries.jl. |
@nalimilan I don't want Query to provide any special lifting support, it is meant to just pick up whatever there is in base. Right now I define a whole bunch of methods that provide lifted implementations for things like |
Which is known as type piracy and changes the behavior of unrelated code. Packages should not be doing that. |
@tkelman Totally agreed, that stuff should all be moved to base. Both NullableArrays and Query are guilty of this right now, as far as I can tell. |
Yeah, we need to move these into Base, and have a package for 0.4 and 0.5 which only provides lifted operators, so that it's clear to the user what happens. But without automatic lifting for other functions, I'm afraid working with Query.jl is going to be quite painful. Time will tell... |
Could we just back port these things to 0.5, and just not support this stuff on 0.4? |
We don't backport behavior changes, only bug fixes. |
I entirely agree that we should not bother supporting 0.4. |
DataFrames master doesn't, so we can't anyway. ¯_(ツ)_/¯ |
Who is going to create the package that hosts the nullable methods for julia 0.5? Is it going to be hosted under JuliaStats? Probably easier if we start with adding these methods in the package, and then once that is done we can move them all into julia base master in one go, right? Would be great if someone could create the package so that we can start contributing, I'd be eager to get this moving. |
As Tony said, packages shouldn't be engaging in type piracy, so perhaps we should just starting making base PRs? |
@ararslan I believe there was agreement (including from @tkelman) that in this case, for julia 0.5 compat only, there would be one package that would have lifted methods and engage in type piracy. That package would only be used on julia 0.5, and the same methods would then be included in julia base for 0.6. Essentially that package would just get all the type piracy that is currently going on in NullableArrays and in Query. If we don't do that, I really don't see how we can possibly tag the new DataFrames before julia 0.6 is released. |
Oh sorry, I didn't realize that such an agreement had been reached. Looks like this is what we're after, then: https://github.com/davidagold/NullableOps.jl |
AFAIK everything is done since I filed JuliaLang/julia#16988. I just need to open a new PR with non-boolean operators and see how the discussion goes. |
@nalimilan I don't think that's the PR you mean. Where's your PR that introduces lifted operators? |
Indeed, I've fixed the link! |
I'm currently working on a macro that might make adding these lifted methods much easier. The general idea is this: @lift function log(x) would for example add a lifted method. The whole thing would work much more general: @lift function +(x::Number, y::Number)
@lift function +{T<:Number}(x::T,y::T) etc. should all work. This macro could be used to then implement another macro called @lifted foo(x::Float64)
return blabla
end would automatically add lifted methods to It is not done, I'm just mentioning this here in case someone else is working on a similar thing, then we should merge effort. |
Doing lifting through selective manual method extension seems like you're repeating the dot-operator mistakes (and |
Port to NullableArrays and CategoricalArrays
This was merged in September 2016, but as far as I can see version 0.10.1 (August 2017) doesn't have this. Is there a plan for releasing? |
This PR has actually been merged, then moved to DataTables, and then replaced with a port to Nulls.jl before being "backported" to DataFrames. See this Discourse post for the full story. We're currently preparing a release. |
Completely remove support for DataArrays.
Still rough, but passes the tests on Julia 0.5.