Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data frame RFC #3

Closed
wants to merge 3 commits into from
Closed

Add data frame RFC #3

wants to merge 3 commits into from

Conversation

sophiajt
Copy link

@sophiajt sophiajt commented Aug 5, 2020

This RFC adds a new "data frame" concept to Nu's internal data representation with a corresponding syntactic representation. Data frames subsume the existing table/row/inner table model with one that is more compact, in line with industry practice, while still maintaining some of the Nu-specific features like streaming.

text/0000-data-frames.md Outdated Show resolved Hide resolved
text/0000-data-frames.md Outdated Show resolved Hide resolved
}.into_value());

output.send(UntaggedValue::EndFrame(frame_id).into_value());
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to have some code sample in the above section showing what it would look like on the receiving end. I'm particularly interested in the more complex cases, like frames being nested in frames.

Some drawbacks come to mind:

- This is a large, non-trivial amount of work. Getting this landed, updating the commands to use the new model, and thoroughly testing will take time.
- This will break most, if not all, third-party plugins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we don't have to do this again anytime soon, but I'm thinking this would be a good opportunity for us to think about deprecations in our protocol. How do we version the plugins, and know what version of the protocol they want? How long do we keep old Value types around before fully deprecating them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely want to add that to the plugin protocol. I don't think it's currently part of it.


Some drawbacks come to mind:

- This is a large, non-trivial amount of work. Getting this landed, updating the commands to use the new model, and thoroughly testing will take time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any plan for transitioning slowly, or does this work have to be done as a single unit? No need to document the plan here, but the ability to iterate on this is not immediately obvious to me, so it may be worth describing if (and perhaps how) that would be possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could do is to document how to transition code from one style to another. We could also support bow Row and Frame for a time, allowing people to transition off the old protocol while we roll onto the new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also support bow Row and Frame for a time, allowing people to transition off the old protocol while we roll onto the new.

I think that'd be the way to go in the future. We would need backwards-incompatible protocol changes an RFC process with a clear timeline. We'll also need to figure out how to communicate that to the nushell community 🙂


[unresolved-questions]: #unresolved-questions

- Are there syntactic ambiguities with the proposed syntax? This will require that we support parsing data frames, which includes colons and commas at the end of bare words.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably that could break things for some users, but the idea of not being 1.0 yet is that we're still trying to figure things out. I doubt it will break much.

That being said, I've been wanting to put together a description of our grammar. Calling out what you're adding and what would break in a grammar would make this super clear 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a section about this.


- Are there syntactic ambiguities with the proposed syntax? This will require that we support parsing data frames, which includes colons and commas at the end of bare words.
- How do we want to handle partial inner data frames? That is, a data frame that is inside of another data frame.
- How do we handle non-data frames in between data frames? Do all partial data frames have to stream out until complete?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, no, but we'd probably have to relay information back through the stream to allow that. Probably an RFC on its own 🙂

- The top-level rows represent a table of rows, but it's unclear how to represent a top-level list of strings vs a stream of strings.
- A similar ambiguity exists between an "object" (a data structure denoted by key/value pairs) and a table of one row
- Inner-tables are modelled differently than top-level tables, leading to confusion
- There is no way to currently represent a matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably, this could be a matrix:

echo [[1 2] [3 4]]

Not saying it'd be easy to work with, but I think it's representable 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, true. I guess a real matrix vs a list of lists. I could call that out


[motivation]: #motivation

The current system has a few unexpected limitations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inlining of nested tables is a limitation right now too, correct? If the nested table is incredibly large, we could easily run out of memory since it doesn't get streamed.

Arguably we could solve this without data frames, but it seems like what's being proposed here will potentially solve that problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that to the list.

Yes, this protocol lets us stream inner tables also, so you could get the initial structure, and remember where the inner tables are, then read the contents of those inner tables from the stream.

length: vec![
vec![ Value::from("head"), Value::from(1024)]
],
partial_frame_id: Some(frame_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we stream nested frames, will different ids be interleaved? Will the onus be on commands to track that? Will there be helper methods/structs for dealing with that? Maybe a light discussion on that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'd probably want some helper methods. Will have to think about that more.

The above code could be created using this Nu syntax:

```sh
[name: [Bob, Sally], age: [30, 43]]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you express this in nu syntax, and/or the above in rust syntax, if you were to state the shape without data? For instance, to say that Windows ls will always have 4 columns and n rows and Windows ls -l will always have 8 columns and n rows.

Alternatively, is there a constructor that says this df is 5 rows and 6 columns?

At some point, I expect we'll have variables that can hold a dataframe. It's hard for me to visualize how this will work in a streaming environment where things are built up and torn down in a pipeline.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fdncred - for the first question, I think you're asking "how do you write types in Nu?" We'll probably need a separate RFC for that, as types will be their own topic.

Or may you're asking how we handle matrices and how this differs from a list of lists?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking about initializing a dataframe with a predetermined shape as ls may have. ls -l on Windows will have a predetermined amount of columns.

One could think of making dataframes with 2 columns and 3 rows as an empty dataframe except with column names, and then, as the pipeline progresses, update the information in those rows. In order to do this, some type of initialization of the df would have to take place. Maybe the term is dataframe literal. I think this is what you've created here [name: [Bob, Sally], age: [30, 43]] but this one is fully populated. Can I do [name: [], age:[]] and then populate it later in the pipeline?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fdncred - ah, I think I got it.

There isn't a way to fill in a dataframe, though we could think of creating some API around that like we do for TaggedDictBuilder and related.

Not sure what you mean by populate it later in the pipeline. Since we're passing values through, you'd create a new value. But maybe these helpers would be able to take in a shape and let you fill it in? Seems doable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, take in a shape and fill it. This may not be exactly functional but once we get to scripts i can easily see initializing a dataframe variable (assuming we have variables) and populating it with various pipelines.

  1. Define a shape with just columns df --define columns 3 name size sum
  2. Now populate it | update sum { ls | get size + accumulator } (bad syntax but hopefully you get the point)
  3. @andrasio probably has examples because he's frequently doing | default wassup 0 | blah | blah | update wassup wassam

The above code could be created using this Nu syntax:

```sh
[name: Thomas, level: 12]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have 100 rows of data, do I have to repeat the column names for each row? It may be nice to consider something like [columns: [name, level], rows:[[Thomas, 12], [Fred, 15], [Mark, 3]]]. Maybe not that exact syntax but you get my meaning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I give an example above for how to write a dataframe. This example is about "objects", or hash tables, so we only have one value per column.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we must be talking passed each other because I really understand what you're saying and I think you didn't understand what I was saying. I'm just showing a possible way of creating a dataframe literal without repeating the column names for every row. I define the column names one time with [columns: [name, level] and then add the rows with [rows:[[Thomas, 12], [Fred, 15], [Mark, 3]]]. [Thomas, 12] is one row, [Fred, 15] is another row, and [Mark, 3] is the last row.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right. I totally missed what the example was saying. Yeah, we could do some kind of tagging like that to differentiate the headers from the rows.

If we go this route, how would it look when there aren't header values?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With no header values, I think we'd just use indexes like 0 and 1 for a two column table and be able to do df | get 0 to get the first columns data.

If we want to leave a column blank and did not previously define the columns, using example above, I'd do something like this [rows:[[,12], [,15], [,3]]]. That creates a 2 column 3 row table. The columns are named 0 and 1, indexes, and the first column is blank but the second column is filled in with 12, 15, 3.

text/0003-data-frames.md Show resolved Hide resolved
text/0003-data-frames.md Show resolved Hide resolved
## Pandas data frame

Below is an example of the pandas data frame:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone is interested, this is where pandas defines the DataFrame class. Lots of code here but interesting. https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py

text/0003-data-frames.md Show resolved Hide resolved
Copy link
Contributor

@jzaefferer jzaefferer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks very interesting to me!


[summary]: #summary

This RFC merges the Row and Table Value types into a single new value type: Frame. Data frames take inspiration from data processing systems like R and Pandas. Data frames will play the fundamental role of modelling data in Nu and will have enough descriptive power to describe all forms of structure, including streaming tables, lists, and objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm completely unfamiliar with R or pandas, and I've never heard the term 'data frame'. Maybe a little more detail or an example could make this summary more accessible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will definitely fill that out. The way I'm using the term here is that it's a 2-dimensional block of data. There are some columns, and these are uniform across all the rows in the block. I think technically data frames are a bit more configurable than that, but I wanted to start with a slightly more restricted definition and adjust from there.

[1, 2, 3]
```

## Objects (aka hash tables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aka dictionary, map? 'hash table' sounds rather implementation specific to me

}
```

**Note:** we use the boolean in the table rather than enumeration because all processing on the frame remains uniform regardless of if the frame is a single row with headers vs an object. This simplifies algorithms to only have to work with the data directly, and we can later represent this data and/or serialize this data in a way that maintains the user's model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use the boolean in the table

What 'table' is this referring to?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 'data frame'. I'm trying to say here that the using a boolean rather than making an enum of the contents allows commands to ignore the object vs row distinction and focus on the data. It's a pretty minor point, admittedly.

The above code could be created using this Nu syntax:

```sh
[name: [Bob, Sally], age: [30, 43]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @fdncred questions from below is a better fit here:

If I have 100 rows of data, do I have to repeat the column names for each row? It may be nice to consider something like [columns: [name, level], rows:[[Thomas, 12], [Fred, 15], [Mark, 3]]]. Maybe not that exact syntax but you get my meaning.

The data frame keeps each row separate, but the proposed table syntax groups by column. That's surprising and maybe not enough.

Maybe the column names can use the argument syntax from alias? With new lines:

[
 {name, age},
 [Thomas, 35],
 [Fred, 15]
]

single line: [{name, age},[Thomas, 35],[Fred, 15]]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, yeah. Something I'm not sure of is rather we should be row-major or column-major inside of the data frame. In practice, we probably filter by column more than row, so grouping column values together internally might make the most sense.

If so, perhaps we reflect that in the syntax.

This feels like something we'll need to actually experiment with to see how it feels in Nu.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine that there's going to be two ways (syntax) to specify tables in columns, both row-major and column-major, while the internal representation should be more predictable. But yeah, some experiments make sense. Since I want to learn Rust, I might try to build a tiny "table" parser myself. Nothing to wait for 😅

Data frame representation:

```rust
struct DataFrame {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'm not familiar with nu's current representations, repeating that here for a quick comparison could help.


## Everything is a frame

One alternative is to require everything to live inside of a frame. There are some advantages here: this is seemingly more uniform, but at the risk of overloading the data frame concept.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this paragraph. What "everything" isn't included in the proposal, for this to be an alternative?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here "everything" would mean all of the data primitives. In practice, this largely changes what data type would be streamed between commands. Commands would interact with each other firstly with a data frame, so that each step would start with a frame first.

I'm not sure if, in practice, this buys us much simplification, but I wanted to at least mention it.


One alternative is to require everything to live inside of a frame. There are some advantages here: this is seemingly more uniform, but at the risk of overloading the data frame concept.

# Prior art
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since streaming seems to be a big motivator for this, I wonder if there's other prior art regarding streams.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a total outsider, I'd take a look at Apache Arrow here. A lot of their messaging/docs are focused on efficient columnar storage (which I assume is not relevant here), but they have two features that are probably interesting for Nu to learn from:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanhdu - thanks for the tip, I'll definitely check these out.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High Level API Docs on Apache Arrow for Rust...

https://docs.rs/arrow/1.0.1/arrow/

Seems to have most of the relevant stuff needed
for generating ideas on how to move forward...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/nevi-me/rust-dataframe/blob/master/notes/update-01__04-04-2020.md

Some more thoughts on dataframes in rust using arrow and a dataframe package


- Are there syntactic ambiguities with the proposed syntax? This will require that we support parsing data frames, which includes colons and commas at the end of bare words.
- How do we want to handle partial inner data frames? That is, a data frame that is inside of another data frame.
- How do we handle non-data frames in between data frames? Do all partial data frames have to stream out until complete?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you consider a non-data frame? As far as I can tell, this proposal doesn't define it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use a better term here. I meant "data types that aren't data frames", like strings, numbers, etc.


We would like to be able to extend Data Frames further to be able to handle sending snapshots of data at the current time. This allows us to stream updates to existing tables, allowing viewers to animate as data is updated.

We may also elect to add type information to the columns, so that we can maintain a more rigorous internal representation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, maybe the headers should be more than an optional list of strings, so that further information can be added there later.

In JavaScript/JSON the solution is to start with a list of objects, instead of a list of strings, so that more properties can be added to the object later. I guess that can be applied here, too.

Maybe in Rust that would be a HashMap, starting with only a name property?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I was hoping we could evolve in that direction rather than trying to figure it out with this RFC. One thing we could do (which I proposed recently) is to create an experimental implementation for data frames and try to add support to a few commands. See how it works in practice, and if it turns out we almost always have the type information there because the source knows it, we can just add it. For example, ls knows all the types of its columns head of time, so just do it.


Frames also allow us to store values in an unboxed way if we can ensure all the values in a column match, and that this holds for all columns in the frame.

Commands that collect a stream into a list could potentially have the optional to merge together all partial data frames into self-contained data frames for further processing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? optional => option

merge together all partial data frames into self-contained data frames for further processing.

Merging partial frames (when the end frame is received) into a single data frame makes sense to me. Though I don't understand the distinction with "self-contained data frames" - how are those different to partial frames? Why would it still be multiple frames, not a single one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a way of saying "a data frame that isn't partial", so all of its data is in that one frame. It would only be the single one, yeah.

@sophiajt
Copy link
Author

sophiajt commented Aug 21, 2020

In chatting with some folks outside of this thread, it sounds like it might be easier to go ahead and implement data frames inside of Nu so we can get more experience with them in practice.

This isn't to say that this will "lock in" data frames as part of the core model, just that it will have time to prove itself out. It also gives us time to experiment with syntax to find the one that feels correct when used in combination with other syntactic forms.

I move that we conditionally accept this RFC enough that we can experimentally implement the feature to learn more. We can opt to remove if it's not a good fit. The hope is that we can get enough information with the experiment that we can revise this RFC with the complete design plus our experience.

If this sounds good, I'll go ahead and move that we implement the proposal and set aside some time to explore it in practice and experiment with syntax as well.

```rust
struct DataFrame {
headers: Option<Vec<String>>,
rows: Vec<Vec<Value>>,
Copy link

@obust obust Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas dataframes are stored as lists of column, each of which is an array for column-based arithmetic efficiency.

Maybe more insightful is a document of a hypothetical pandas 2.0 design if pandas was rewritten.
https://dev.pandas.io/pandas2/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the heads up! Will definitely check it out

@sophiajt
Copy link
Author

Just to clarify since some folks were wondering. The vote above is not for landing the RFC. It's for accepting it enough that we can land an experimental implementation into Nu itself and get some experience with different designs before we settle on one.

Once we have, we'll come back to the RFC, report on what we found, and from there we can decide to accept/reject.

@elferherrera
Copy link

If you are planning to create this dataframe structure, could it be useful to use Arrow as the holding structure for the data? This would allow nu to share data easily with other systems via Arrow IPC. It could also help to create queries on data (parquet or CSV files) using datafusion. I was thinking that querying data from a file could be a nice plugin, but it would be nice it is a main feature of nu.

@stormasm
Copy link

@elferherrera yes I agree that doing further research on how Arrow would incorporate into Nushell would be the way to go if we move forward with this approach... We would like to have more people on the team who has expertise or experience using Arrow; so thanks for providing feedback... Also we have a design-discussion channel on discord for further discussion as well...

@elferherrera
Copy link

elferherrera commented Apr 1, 2021

@stormasm I think I can help with this. Unfortunately nu structure is quite complex and I am trying to get familiar with the whole code and how it works.

@stormasm
Copy link

stormasm commented Apr 1, 2021

@elferherrera cool ! glad to have you looking at the source code and coming up to speed... best place to reach out would be on discord --- you can find me there --- or here for more details on this particular RFC. Thank you...

@elferherrera
Copy link

@stormasm would you be able to also consider using polars as the base for this dataframe structure? or is it and overkill for the type of implementation you want?

@stormasm
Copy link

stormasm commented Apr 2, 2021

https://github.com/ritchie46/polars

@elferherrera is this what you are referring to ?

@elferherrera
Copy link

@stormasm That's the one. It is a pandas like implementation of a dataframe using arrow

@stormasm
Copy link

stormasm commented Apr 2, 2021

@elferherrera you might want to check in with @jonathandturner as well to see how
RFC: DataFrame
would fit into
RFC: Proposal for shipping 1.0

@sophiajt
Copy link
Author

Closing as dataframe is now part of Nushell. While we need to explore a bit to find its 1.0 design, it's probably a better place for design than this RFC.

@sophiajt sophiajt closed this Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants