Add data frame RFC #3

sophiajt · 2020-08-05T02:41:31Z

This RFC adds a new "data frame" concept to Nu's internal data representation with a corresponding syntactic representation. Data frames subsume the existing table/row/inner table model with one that is more compact, in line with industry practice, while still maintaining some of the Nu-specific features like streaming.

text/0000-data-frames.md

thegedge · 2020-08-05T11:19:55Z

text/0003-data-frames.md

+}.into_value());
+
+output.send(UntaggedValue::EndFrame(frame_id).into_value());
+```


It might be good to have some code sample in the above section showing what it would look like on the receiving end. I'm particularly interested in the more complex cases, like frames being nested in frames.

thegedge · 2020-08-05T11:21:25Z

text/0003-data-frames.md

+Some drawbacks come to mind:
+
+- This is a large, non-trivial amount of work. Getting this landed, updating the commands to use the new model, and thoroughly testing will take time.
+- This will break most, if not all, third-party plugins


Hopefully we don't have to do this again anytime soon, but I'm thinking this would be a good opportunity for us to think about deprecations in our protocol. How do we version the plugins, and know what version of the protocol they want? How long do we keep old Value types around before fully deprecating them?

We'll definitely want to add that to the plugin protocol. I don't think it's currently part of it.

thegedge · 2020-08-05T11:22:46Z

text/0003-data-frames.md

+
+Some drawbacks come to mind:
+
+- This is a large, non-trivial amount of work. Getting this landed, updating the commands to use the new model, and thoroughly testing will take time.


Any plan for transitioning slowly, or does this work have to be done as a single unit? No need to document the plan here, but the ability to iterate on this is not immediately obvious to me, so it may be worth describing if (and perhaps how) that would be possible.

One thing we could do is to document how to transition code from one style to another. We could also support bow Row and Frame for a time, allowing people to transition off the old protocol while we roll onto the new.

We could also support bow Row and Frame for a time, allowing people to transition off the old protocol while we roll onto the new.

I think that'd be the way to go in the future. We would need backwards-incompatible protocol changes an RFC process with a clear timeline. We'll also need to figure out how to communicate that to the nushell community 🙂

thegedge · 2020-08-05T11:58:40Z

text/0003-data-frames.md

+
+[unresolved-questions]: #unresolved-questions
+
+- Are there syntactic ambiguities with the proposed syntax? This will require that we support parsing data frames, which includes colons and commas at the end of bare words.


Arguably that could break things for some users, but the idea of not being 1.0 yet is that we're still trying to figure things out. I doubt it will break much.

That being said, I've been wanting to put together a description of our grammar. Calling out what you're adding and what would break in a grammar would make this super clear 🙂

I'll add a section about this.

thegedge · 2020-08-05T11:59:25Z

text/0003-data-frames.md

+
+- Are there syntactic ambiguities with the proposed syntax? This will require that we support parsing data frames, which includes colons and commas at the end of bare words.
+- How do we want to handle partial inner data frames? That is, a data frame that is inside of another data frame.
+- How do we handle non-data frames in between data frames? Do all partial data frames have to stream out until complete?


Ideally, no, but we'd probably have to relay information back through the stream to allow that. Probably an RFC on its own 🙂

thegedge · 2020-08-05T12:05:26Z

text/0003-data-frames.md

+- The top-level rows represent a table of rows, but it's unclear how to represent a top-level list of strings vs a stream of strings.
+- A similar ambiguity exists between an "object" (a data structure denoted by key/value pairs) and a table of one row
+- Inner-tables are modelled differently than top-level tables, leading to confusion
+- There is no way to currently represent a matrix


Arguably, this could be a matrix:

echo [[1 2] [3 4]]

Not saying it'd be easy to work with, but I think it's representable 🙂

lol, true. I guess a real matrix vs a list of lists. I could call that out

thegedge · 2020-08-05T12:09:25Z

text/0003-data-frames.md

+
+[motivation]: #motivation
+
+The current system has a few unexpected limitations:


The inlining of nested tables is a limitation right now too, correct? If the nested table is incredibly large, we could easily run out of memory since it doesn't get streamed.

Arguably we could solve this without data frames, but it seems like what's being proposed here will potentially solve that problem?

I'll add that to the list.

Yes, this protocol lets us stream inner tables also, so you could get the initial structure, and remember where the inner tables are, then read the contents of those inner tables from the stream.

thegedge · 2020-08-05T12:11:10Z

text/0003-data-frames.md

+    length: vec![
+        vec![ Value::from("head"), Value::from(1024)]
+    ],
+    partial_frame_id: Some(frame_id),


If we stream nested frames, will different ids be interleaved? Will the onus be on commands to track that? Will there be helper methods/structs for dealing with that? Maybe a light discussion on that.

Yeah, we'd probably want some helper methods. Will have to think about that more.

fdncred · 2020-08-05T13:05:50Z

text/0003-data-frames.md

+The above code could be created using this Nu syntax:
+
+```sh
+[name: [Bob, Sally], age: [30, 43]]


How would you express this in nu syntax, and/or the above in rust syntax, if you were to state the shape without data? For instance, to say that Windows ls will always have 4 columns and n rows and Windows ls -l will always have 8 columns and n rows.

Alternatively, is there a constructor that says this df is 5 rows and 6 columns?

At some point, I expect we'll have variables that can hold a dataframe. It's hard for me to visualize how this will work in a streaming environment where things are built up and torn down in a pipeline.

@fdncred - for the first question, I think you're asking "how do you write types in Nu?" We'll probably need a separate RFC for that, as types will be their own topic.

Or may you're asking how we handle matrices and how this differs from a list of lists?

I was asking about initializing a dataframe with a predetermined shape as ls may have. ls -l on Windows will have a predetermined amount of columns.

One could think of making dataframes with 2 columns and 3 rows as an empty dataframe except with column names, and then, as the pipeline progresses, update the information in those rows. In order to do this, some type of initialization of the df would have to take place. Maybe the term is dataframe literal. I think this is what you've created here [name: [Bob, Sally], age: [30, 43]] but this one is fully populated. Can I do [name: [], age:[]] and then populate it later in the pipeline?

@fdncred - ah, I think I got it.

There isn't a way to fill in a dataframe, though we could think of creating some API around that like we do for TaggedDictBuilder and related.

Not sure what you mean by populate it later in the pipeline. Since we're passing values through, you'd create a new value. But maybe these helpers would be able to take in a shape and let you fill it in? Seems doable.

Yes, take in a shape and fill it. This may not be exactly functional but once we get to scripts i can easily see initializing a dataframe variable (assuming we have variables) and populating it with various pipelines.

Define a shape with just columns df --define columns 3 name size sum

Now populate it | update sum { ls | get size + accumulator } (bad syntax but hopefully you get the point)

@andrasio probably has examples because he's frequently doing | default wassup 0 | blah | blah | update wassup wassam

fdncred · 2020-08-05T13:11:05Z

text/0003-data-frames.md

+The above code could be created using this Nu syntax:
+
+```sh
+[name: Thomas, level: 12]


If I have 100 rows of data, do I have to repeat the column names for each row? It may be nice to consider something like [columns: [name, level], rows:[[Thomas, 12], [Fred, 15], [Mark, 3]]]. Maybe not that exact syntax but you get my meaning.

I give an example above for how to write a dataframe. This example is about "objects", or hash tables, so we only have one value per column.

I think we must be talking passed each other because I really understand what you're saying and I think you didn't understand what I was saying. I'm just showing a possible way of creating a dataframe literal without repeating the column names for every row. I define the column names one time with [columns: [name, level] and then add the rows with [rows:[[Thomas, 12], [Fred, 15], [Mark, 3]]]. [Thomas, 12] is one row, [Fred, 15] is another row, and [Mark, 3] is the last row.

Sorry, you're right. I totally missed what the example was saying. Yeah, we could do some kind of tagging like that to differentiate the headers from the rows.

If we go this route, how would it look when there aren't header values?

With no header values, I think we'd just use indexes like 0 and 1 for a two column table and be able to do df | get 0 to get the first columns data.

If we want to leave a column blank and did not previously define the columns, using example above, I'd do something like this [rows:[[,12], [,15], [,3]]]. That creates a 2 column 3 row table. The columns are named 0 and 1, indexes, and the first column is blank but the second column is filled in with 12, 15, 3.

text/0003-data-frames.md

fdncred · 2020-08-05T13:28:45Z

text/0003-data-frames.md

+## Pandas data frame
+
+Below is an example of the pandas data frame:
+


If anyone is interested, this is where pandas defines the DataFrame class. Lots of code here but interesting. https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py

text/0003-data-frames.md

jzaefferer

Generally looks very interesting to me!

jzaefferer · 2020-08-08T08:53:42Z

text/0003-data-frames.md

+
+[summary]: #summary
+
+This RFC merges the Row and Table Value types into a single new value type: Frame. Data frames take inspiration from data processing systems like R and Pandas. Data frames will play the fundamental role of modelling data in Nu and will have enough descriptive power to describe all forms of structure, including streaming tables, lists, and objects.


I'm completely unfamiliar with R or pandas, and I've never heard the term 'data frame'. Maybe a little more detail or an example could make this summary more accessible?

Yeah, will definitely fill that out. The way I'm using the term here is that it's a 2-dimensional block of data. There are some columns, and these are uniform across all the rows in the block. I think technically data frames are a bit more configurable than that, but I wanted to start with a slightly more restricted definition and adjust from there.

jzaefferer · 2020-08-08T09:00:14Z

text/0003-data-frames.md

+[1, 2, 3]
+```
+
+## Objects (aka hash tables)


aka dictionary, map? 'hash table' sounds rather implementation specific to me

jzaefferer · 2020-08-08T09:01:19Z

text/0003-data-frames.md

+}
+```
+
+**Note:** we use the boolean in the table rather than enumeration because all processing on the frame remains uniform regardless of if the frame is a single row with headers vs an object. This simplifies algorithms to only have to work with the data directly, and we can later represent this data and/or serialize this data in a way that maintains the user's model.


we use the boolean in the table

What 'table' is this referring to?

Should be 'data frame'. I'm trying to say here that the using a boolean rather than making an enum of the contents allows commands to ignore the object vs row distinction and focus on the data. It's a pretty minor point, admittedly.

jzaefferer · 2020-08-08T09:07:39Z

text/0003-data-frames.md

+The above code could be created using this Nu syntax:
+
+```sh
+[name: [Bob, Sally], age: [30, 43]]


I think @fdncred questions from below is a better fit here:

If I have 100 rows of data, do I have to repeat the column names for each row? It may be nice to consider something like [columns: [name, level], rows:[[Thomas, 12], [Fred, 15], [Mark, 3]]]. Maybe not that exact syntax but you get my meaning.

The data frame keeps each row separate, but the proposed table syntax groups by column. That's surprising and maybe not enough.

Maybe the column names can use the argument syntax from alias? With new lines:

[ {name, age}, [Thomas, 35], [Fred, 15] ]

single line: [{name, age},[Thomas, 35],[Fred, 15]]

Potentially, yeah. Something I'm not sure of is rather we should be row-major or column-major inside of the data frame. In practice, we probably filter by column more than row, so grouping column values together internally might make the most sense.

If so, perhaps we reflect that in the syntax.

This feels like something we'll need to actually experiment with to see how it feels in Nu.

I can imagine that there's going to be two ways (syntax) to specify tables in columns, both row-major and column-major, while the internal representation should be more predictable. But yeah, some experiments make sense. Since I want to learn Rust, I might try to build a tiny "table" parser myself. Nothing to wait for 😅

jzaefferer · 2020-08-08T09:08:33Z

text/0003-data-frames.md

+Data frame representation:
+
+```rust
+struct DataFrame {


Since I'm not familiar with nu's current representations, repeating that here for a quick comparison could help.

jzaefferer · 2020-08-08T09:15:52Z

text/0003-data-frames.md

+
+## Everything is a frame
+
+One alternative is to require everything to live inside of a frame. There are some advantages here: this is seemingly more uniform, but at the risk of overloading the data frame concept.


I don't understand this paragraph. What "everything" isn't included in the proposal, for this to be an alternative?

Here "everything" would mean all of the data primitives. In practice, this largely changes what data type would be streamed between commands. Commands would interact with each other firstly with a data frame, so that each step would start with a frame first.

I'm not sure if, in practice, this buys us much simplification, but I wanted to at least mention it.

jzaefferer · 2020-08-08T09:19:20Z

text/0003-data-frames.md

+
+One alternative is to require everything to live inside of a frame. There are some advantages here: this is seemingly more uniform, but at the risk of overloading the data frame concept.
+
+# Prior art


Since streaming seems to be a big motivator for this, I wonder if there's other prior art regarding streams.

As a total outsider, I'd take a look at Apache Arrow here. A lot of their messaging/docs are focused on efficient columnar storage (which I assume is not relevant here), but they have two features that are probably interesting for Nu to learn from:

Good support for "nested" types (e.g. arbitrary JSON, or ragged arrays): see https://arrow.apache.org/docs/python/data.html#type-metadata and https://arrow.apache.org/docs/format/Columnar.html

A streaming format (although it's tuned towards RPC): see https://arrow.apache.org/docs/python/ipc.html and https://arrow.apache.org/docs/format/Flight.html

@alanhdu - thanks for the tip, I'll definitely check these out.

High Level API Docs on Apache Arrow for Rust...

https://docs.rs/arrow/1.0.1/arrow/

Seems to have most of the relevant stuff needed
for generating ideas on how to move forward...

https://github.com/nevi-me/rust-dataframe/blob/master/notes/update-01__04-04-2020.md

Some more thoughts on dataframes in rust using arrow and a dataframe package

jzaefferer · 2020-08-08T09:21:40Z

text/0003-data-frames.md

+
+- Are there syntactic ambiguities with the proposed syntax? This will require that we support parsing data frames, which includes colons and commas at the end of bare words.
+- How do we want to handle partial inner data frames? That is, a data frame that is inside of another data frame.
+- How do we handle non-data frames in between data frames? Do all partial data frames have to stream out until complete?


What do you consider a non-data frame? As far as I can tell, this proposal doesn't define it.

I'll use a better term here. I meant "data types that aren't data frames", like strings, numbers, etc.

jzaefferer · 2020-08-08T09:25:08Z

text/0003-data-frames.md

+
+We would like to be able to extend Data Frames further to be able to handle sending snapshots of data at the current time. This allows us to stream updates to existing tables, allowing viewers to animate as data is updated.
+
+We may also elect to add type information to the columns, so that we can maintain a more rigorous internal representation.


In that case, maybe the headers should be more than an optional list of strings, so that further information can be added there later.

In JavaScript/JSON the solution is to start with a list of objects, instead of a list of strings, so that more properties can be added to the object later. I guess that can be applied here, too.

Maybe in Rust that would be a HashMap, starting with only a name property?

Agreed. I was hoping we could evolve in that direction rather than trying to figure it out with this RFC. One thing we could do (which I proposed recently) is to create an experimental implementation for data frames and try to add support to a few commands. See how it works in practice, and if it turns out we almost always have the type information there because the source knows it, we can just add it. For example, ls knows all the types of its columns head of time, so just do it.

jzaefferer · 2020-08-08T09:27:10Z

text/0003-data-frames.md

+
+Frames also allow us to store values in an unboxed way if we can ensure all the values in a column match, and that this holds for all columns in the frame.
+
+Commands that collect a stream into a list could potentially have the optional to merge together all partial data frames into self-contained data frames for further processing.


typo? optional => option

merge together all partial data frames into self-contained data frames for further processing.

Merging partial frames (when the end frame is received) into a single data frame makes sense to me. Though I don't understand the distinction with "self-contained data frames" - how are those different to partial frames? Why would it still be multiple frames, not a single one?

This is a way of saying "a data frame that isn't partial", so all of its data is in that one frame. It would only be the single one, yeah.

sophiajt · 2020-08-21T02:55:27Z

In chatting with some folks outside of this thread, it sounds like it might be easier to go ahead and implement data frames inside of Nu so we can get more experience with them in practice.

This isn't to say that this will "lock in" data frames as part of the core model, just that it will have time to prove itself out. It also gives us time to experiment with syntax to find the one that feels correct when used in combination with other syntactic forms.

I move that we conditionally accept this RFC enough that we can experimentally implement the feature to learn more. We can opt to remove if it's not a good fit. The hope is that we can get enough information with the experiment that we can revise this RFC with the complete design plus our experience.

If this sounds good, I'll go ahead and move that we implement the proposal and set aside some time to explore it in practice and experiment with syntax as well.

obust · 2020-08-12T07:20:06Z

text/0003-data-frames.md

+```rust
+struct DataFrame {
+    headers: Option<Vec<String>>,
+    rows: Vec<Vec<Value>>,


Pandas dataframes are stored as lists of column, each of which is an array for column-based arithmetic efficiency.

Maybe more insightful is a document of a hypothetical pandas 2.0 design if pandas was rewritten.
https://dev.pandas.io/pandas2/

Cool, thanks for the heads up! Will definitely check it out

sophiajt · 2020-08-21T19:24:27Z

Just to clarify since some folks were wondering. The vote above is not for landing the RFC. It's for accepting it enough that we can land an experimental implementation into Nu itself and get some experience with different designs before we settle on one.

Once we have, we'll come back to the RFC, report on what we found, and from there we can decide to accept/reject.

elferherrera · 2021-03-31T08:01:02Z

If you are planning to create this dataframe structure, could it be useful to use Arrow as the holding structure for the data? This would allow nu to share data easily with other systems via Arrow IPC. It could also help to create queries on data (parquet or CSV files) using datafusion. I was thinking that querying data from a file could be a nice plugin, but it would be nice it is a main feature of nu.

stormasm · 2021-03-31T15:37:11Z

@elferherrera yes I agree that doing further research on how Arrow would incorporate into Nushell would be the way to go if we move forward with this approach... We would like to have more people on the team who has expertise or experience using Arrow; so thanks for providing feedback... Also we have a design-discussion channel on discord for further discussion as well...

elferherrera · 2021-04-01T09:20:14Z

@stormasm I think I can help with this. Unfortunately nu structure is quite complex and I am trying to get familiar with the whole code and how it works.

stormasm · 2021-04-01T16:58:24Z

@elferherrera cool ! glad to have you looking at the source code and coming up to speed... best place to reach out would be on discord --- you can find me there --- or here for more details on this particular RFC. Thank you...

elferherrera · 2021-04-02T09:18:47Z

@stormasm would you be able to also consider using polars as the base for this dataframe structure? or is it and overkill for the type of implementation you want?

stormasm · 2021-04-02T16:22:47Z

https://github.com/ritchie46/polars

@elferherrera is this what you are referring to ?

elferherrera · 2021-04-02T16:25:59Z

@stormasm That's the one. It is a pandas like implementation of a dataframe using arrow

stormasm · 2021-04-02T18:07:31Z

@elferherrera you might want to check in with @jonathandturner as well to see how
RFC: DataFrame
would fit into
RFC: Proposal for shipping 1.0

sophiajt · 2022-04-29T01:11:12Z

Closing as dataframe is now part of Nushell. While we need to explore a bit to find its 1.0 design, it's probably a better place for design than this RFC.

sophiajt and others added 3 commits August 5, 2020 14:39

Add data frame RFC

5862e27

Update and rename 0000-data-frames.md to 0003-data-frames.md

855fc79

Update 0003-data-frames.md

6b7d41d

thegedge reviewed Aug 5, 2020

View reviewed changes

fdncred reviewed Aug 5, 2020

View reviewed changes

thegedge mentioned this pull request Aug 6, 2020

pipeline does not preserve JSON structure nushell/nushell#2295

Closed

jzaefferer reviewed Aug 8, 2020

View reviewed changes

obust reviewed Aug 21, 2020

View reviewed changes

sophiajt mentioned this pull request Sep 11, 2020

Var args in alias nushell/nushell#2486

Closed

thegedge mentioned this pull request Sep 15, 2020

Type deduction RFC #4

Merged

sophiajt closed this Apr 29, 2022


		Some drawbacks come to mind:

		- This is a large, non-trivial amount of work. Getting this landed, updating the commands to use the new model, and thoroughly testing will take time.


		[unresolved-questions]: #unresolved-questions

		- Are there syntactic ambiguities with the proposed syntax? This will require that we support parsing data frames, which includes colons and commas at the end of bare words.


		[motivation]: #motivation

		The current system has a few unexpected limitations:

		## Pandas data frame

		Below is an example of the pandas data frame:


		[summary]: #summary

		This RFC merges the Row and Table Value types into a single new value type: Frame. Data frames take inspiration from data processing systems like R and Pandas. Data frames will play the fundamental role of modelling data in Nu and will have enough descriptive power to describe all forms of structure, including streaming tables, lists, and objects.


		## Everything is a frame

		One alternative is to require everything to live inside of a frame. There are some advantages here: this is seemingly more uniform, but at the risk of overloading the data frame concept.


		One alternative is to require everything to live inside of a frame. There are some advantages here: this is seemingly more uniform, but at the risk of overloading the data frame concept.

		# Prior art


		We would like to be able to extend Data Frames further to be able to handle sending snapshots of data at the current time. This allows us to stream updates to existing tables, allowing viewers to animate as data is updated.

		We may also elect to add type information to the columns, so that we can maintain a more rigorous internal representation.


		Frames also allow us to store values in an unboxed way if we can ensure all the values in a column match, and that this holds for all columns in the frame.

		Commands that collect a stream into a list could potentially have the optional to merge together all partial data frames into self-contained data frames for further processing.

Add data frame RFC #3

Add data frame RFC #3

Conversation

sophiajt commented Aug 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jzaefferer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sophiajt commented Aug 21, 2020 • edited by thegedge Loading

obust Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sophiajt commented Aug 21, 2020

elferherrera commented Mar 31, 2021

stormasm commented Mar 31, 2021

elferherrera commented Apr 1, 2021 • edited Loading

stormasm commented Apr 1, 2021

elferherrera commented Apr 2, 2021

stormasm commented Apr 2, 2021

elferherrera commented Apr 2, 2021

stormasm commented Apr 2, 2021 • edited Loading

sophiajt commented Apr 29, 2022

sophiajt commented Aug 21, 2020 •

edited by thegedge

Loading

obust Aug 12, 2020 •

edited

Loading

elferherrera commented Apr 1, 2021 •

edited

Loading

stormasm commented Apr 2, 2021 •

edited

Loading