-
Notifications
You must be signed in to change notification settings - Fork 371
column types depend on how dataframes are declared #1091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note that DataArrays are no longer used by DataFrames. Do you still see this behavior with the current master? |
Sorry, should have specified, this was with the latest release 0.8.3. It didn't occur to me that this was something that changed (as I didn't realize DataArray is no longer used). |
No problem. Yeah, the switch away from DataArrays is a pretty major change, so we're waiting a bit to tag it as a release. |
I'm curious, could you point me to the issue or discussion where you discuss this change? I'm assuming this is for efficiency purposes and I'm wondering what the exact motivations are. Thanks. |
I've preserved this behavior when porting to @ExpandingMan It's not done only for efficiency purposes, but mainly to allow you to store missing values in any column. Without this, you need to convert the column first to be able to add an |
Hmm... I'm probably not understanding this because that seems like a really huge change to fix what seems to me like a very small problem. In most real cases all of the columns are |
Not all people agree with this appreciation. If your data doesn't contain any missing values (not my case unfortunately), it's pointless to create a |
There are two distinct issues:
|
It definitely seems to me that it would be rather difficult to accommodate dataframes that store columns as one of several different types in the general case. Now every program that uses dataframes needs different functions for handling dataframes depending on the types of the columns. This wouldn't be especially clean because the super type of |
It alleviates the unnecessary efficiency issues, but there will always be an overhead to allowing for missing values. |
This has to do with type-inferability of EDIT: Well, nothing directly, anyway. |
So do you guys have any timeline in mind for the release of the NullableArrays version? I've started re-writing some code to work with master and I was wondering how far out the official release might be. (And sorry, the issue thread probably isn't the appropriate place to be asking this sort of question.) |
See #1092 and linked discussions. |
Has there been any thought to adding a new subtype of This way data could be stored in |
For now we have enough work to do to get DataFrames working well in the new |
I think @ExpandingMan is onto something, though I'm wondering if in practice it could get messy. Note that SQL (or at least some variants?) have a concept of |
The specific problem here is that Feather doesn't support non-nullable columns. So there's no way of storing that type information which exists in Julia (and e.g. in SQL). It's just like what happens when saving to CSV and reloading data (though less severe). |
Yes but with the interface in its current state, it actually does make a difference, as if you get values from a |
I'm not opposed to some sort of |
That would certainly be the best possible outcome. I think it would be imperative to ensure that it be very rare for code to have to check the types of the columns, and in order to do this, it might require making some changes to the interface that may not be popular. For example, for |
See also previous discussion at #1008 (comment) and following comments. |
And new discussion at #1119. |
Closing as DataFrames (current master) doesn't auto-promote columns anymore; so whatever you put inside is what it will stay and what you will get out again. |
If this is a duplicate issue I apologize as it seems likely this has been discussed before but I couldn't find it.
If a
DataFrame
is declared with an array as its argumentDataFrame(A)
its columns are of typeVector
.If a
DataFrame
is declared with separate arrays for each columnDataFrame(A=A, B=B)
ordf = DataFrame(); df[:A] =A; df[:B] = B
then its columns are of typeDataArray
.This seems like bad behavior. Shouldn't columns always be of type
DataArray
regardless of how theDataFrame
was constructed?The text was updated successfully, but these errors were encountered: