Relative names #1619

aljazerzen · 2023-01-24T16:30:04Z

Abstract:
I propose to change references to columns from column to .column.

Reasoning:
I'll try to explain how resolver works and how I think about semantics of name and variables in PRQL.

During resolving, there is a major distinction between scoped and ephemeral variables:

Scoped variables have a definition and live until their scope exists. For example, std.sum and std.select are global so they exist indefinitely, and function parameters exist only within function body.
Ephemeral variables are just references into some other argument of a current function call. For example, when you call select, all columns of the relation exist as variables during resolution of the first argument.

It is beneficial to distinguish these two mechanism, because of their subtle differences. For example take this query:

func my_transform rel -> (
    rel
    select [alb.title, artist_id]
)

from alb = albums
my_transfrom

Here, relation is constructed with from and within the relation a name alb is assigned all column from table albums. Note that alb is not a "real" value, it's just a namespace for the columns. When this relation is passed to my_transform, it is stored in the rel parameter. rel is now a scoped variable while alb.title is a reference to one of its columns.

I'm not sure if I've explained that well, please tell me if I haven't.

If I compare this behavior with, say, Python and a dataframe library, scoped variables are all normal idents, while ephemeral variables would be represented with strings. This is a bit more verbose and cannot provide good errors, typing or autocomplete. (This is feature of PRQL that dataframe libraries cannot copy. Only a custom language for relations can construct custom rules for name resolution.)

So because there is distinction in resolving, I suggest we add a distinction in syntax:

func my_transform rel -> (
    rel
    select [.alb.title, .artist_id]
)

from alb = albums
my_transfrom
sort .title

Pros:

distinction in syntax hints to the distinction in resolving
for newcomers, the rule is simple: columns start with a dot

Cons:

additional syntax we could be without

The text was updated successfully, but these errors were encountered:

snth · 2023-01-24T21:48:36Z

I quite like the idea of a leading . for columns. I don't really know why yet but it feels like it would bring additional consistency. It also reminds me of JDOT (https://github.com/saulpw/jdot).

TBH, I did not understand the name resolution explanation yet but I will try again in the morning (it's close to midnight now). For example why is it alb.title in my_transform initially and not rel.title? And with the new syntax, why is it still .alb.title and not just .title (or if the alb is required then alb.title)?

Another possible benefit could be that it might disambiguate a column named "from" from the keyword from since the column would be referred to as .from. (IDK if this is currently a problem for the parser/compiler.)

eitsupi · 2023-01-28T14:41:39Z

It also reminds me of JDOT (https://github.com/saulpw/jdot).

Perhaps the origin is jq?
https://stedolan.github.io/jq/

I think jq is a very popular language for writing queries to json.

max-sixty · 2023-02-03T12:27:40Z

Sorry to take a while to respond.

I think I'm understanding 85% of this,so forgive me if I'm slow.

I can see two points here;

discriminate between scoped and emphermal variables
use .foo for some variables

Re the discriminaring — how easy do you think it is to explain when to use a period vs. not to? I worry it's not easy! (But possibly we could make it easier).

Re the periods — I don't have a strong secular objection to it. It would be a big change, and I'm not sure it gets us that much apart from the discrimination. But it is an effective way of allowing columns to be clearly different from functions.

To what extent do you think it's accurate to describe emphermal variables as just having a scope that's limited to that line?

aljazerzen · 2023-02-03T13:10:19Z

It just one point here: use .foo for ephemeral variables.

The rule for when to use the dot is simple: columns start with a dot.

describe emphermal variables as just having a scope that's limited to that line?

That's pretty accurate. But it may be confusing because even though the scope is limited to current function, almost identical scope could be created for next function in the pipeline.

max-sixty · 2023-02-03T20:46:34Z

It just one point here: use .foo for ephemeral variables.

Totally, but is there an easy way to define ephemeral variables to beginners?

aljazerzen · 2023-02-04T09:35:07Z

I'm saying that for beginners, ephemeral variables can be equivalent to columns. So the whole rule is columns start with a dot. And we don't even mention ephemeral variables.

That's because we don't have anything other than relations that we'd want to have references into. Maybe in the future, we could add support for referencing properties of JSON objects or structs.

max-sixty · 2023-02-04T09:55:14Z

Yes OK, that is complete in the examples above.

How about when it's a variable; for example:

func add a b -> a + b
# or
func add a b -> .a + .b
# or
func add .a .b -> .a + .b

Thanks for bearing with me...

aljazerzen · 2023-02-04T15:34:35Z

Oh, params are scoped variables so they don't need a leading dot. So like this:

func add a b -> a + b

func latest n rel -> (rel | sort [-.changed_at] | take n)

# rel and n are params -> scoped -> no dot
# .changed_at is a column (reference "into" rel) -> ephemeral variable -> dot

max-sixty · 2023-02-04T22:05:59Z

OK great, I see, thanks.

I think it's tractable. I don't think it's that friendly, and it's much more alien for those who are used to SQL.

Do others share a concern that represents hierarchies inconsistently? For example alb is a relation. But to go into that hierarchy involves adding a period at its start; i.e. .alb.title. Generally to move down a hierarchy we'd only add things onto the end like alb.title or alb["title"]

I think this is insightful, and maybe we should discuss it more in our docs...

If I compare this behavior with, say, Python and a dataframe library, scoped variables are all normal idents, while ephemeral variables would be represented with strings. This is a bit more verbose and cannot provide good errors, typing or autocomplete. (This is feature of PRQL that dataframe libraries cannot copy. Only a custom language for relations can construct custom rules for name resolution.)

....I've heard this referred to as "bare words". I find it a great advantage of PRQL over something like python. It makes sense that we promote columns to not require quotes, since columns are so important in tabular data; they're almost like variables to us.

As @eitsupi points out, jq uses the .foo syntax, and that's worked well, though they use it all the way down the hierachy; i.e. .alb, never just alb.

So my current view is:

Has some nice properties
Concern about friendliness / alien-ness (but shouldn't be weighed highly unless this is a consensus view)
Concern about hierarchies

How important do you think it is for the development of the lang? Can we instead have a hierarchy of scopes (like many langs do), and resolve ephemeral variables first, and scoped variable after that?

eitsupi · 2023-02-05T16:37:08Z

I recall that in dplyr, it is sometimes difficult to distinguish between variables outside the data frame and column names in the data frame, making the behavior confusing.

cyl <- 10

mtcars |>
  dplyr::mutate(new = cyl * 10)

It can be specified explicitly by .data or .env (but many people rarely do this because it increases the amount of writing).
https://rlang.r-lib.org/reference/dot-data.html

cyl <- 10

mtcars |>
  dplyr::mutate(new = .data$cyl * 10)

I think it is a good balance of clarity and ease of writing to always start column names with a dot.

aljazerzen · 2023-03-01T15:17:00Z

I've implemented the proposal and converted the tests in prql-compiler.

Here are a few examples:

from daily_orders
sort .day
group .month (sort .num_orders | window expanding:true (derive rank))
derive [num_orders_last_week = lag 7 .num_orders]

from employees
derive rn = row_number
filter .rn > 2

from employees
derive age = .year_born - s'now()'
select [
    f"Hello my name is {.prefix}{.first_name} {.last_name}",
    f"and I am {.age} years old."
]

from employees
derive count = 12
select [
    twelve = .count,
    aggregated = count,
    aggregated_verbose = std.count,
]

Here is my findings:

this syntax is more verbose and less beginner-friendly than what we had before,
it simplifies the implementation a bit,
in some cases it is less ambiguous (see last example),
it would be nice for auto-complete, since typing . would bring up just columns for current relation,
there is a bit of inconsistency where we derive new names without the dot, but reference them with the dot,
we can now use .* to refer to all columns of the relation, where before we could not use * (since that would be parsed as multiplication).

Possible alternatives:

the leading dot is not required, but just encouraged,
the special leading dot syntax is replaced with a full name, like rel.first_name instead of just .first_name. In this case, rel. prefix would also be optional.

max-sixty · 2023-03-01T20:13:40Z

Thanks for the list of findings, that's v helpful to anchor around.

Do others share a concern that represents hierarchies inconsistently? For example alb is a relation. But to go into that hierarchy involves adding a period at its start; i.e. .alb.title. Generally to move down a hierarchy we'd only add things onto the end like alb.title or alb["title"]

Is this still the same for the full path of columns? Or does alb.title work?

I think the .col syntax is fine from a blank slate, but — overall, in the current state I'm fairly strongly -1.

It's a very large change
The benefits don't seem that high. I do weigh compiler simplicity highly, since it lets us move faster with a wider group of contributors. But how great a simplification is it / do we think it would let us do much more much faster? (I might be underweighing the extent of the simplification)
There are some quite sharp corners IMO — the violation of the hierarchy as above, and the the lack of coherence between lvalues and rvalues ("there is a bit of inconsistency where we derive new names without the dot, but reference them with the dot,"). I think these could be confusing for newcomers.
- An example of this in jq, which uses dots, but is consistent across these

One lens to view this is what we'd write in the Changelog — I'm not sure what we'd write that I'd feel great about...

aljazerzen · 2023-03-02T11:05:58Z

Do others share a concern that represents hierarchies inconsistently? For example alb is a relation. But to go into that hierarchy involves adding a period at its start; i.e. .alb.title. Generally to move down a hierarchy we'd only add things onto the end like alb.title or alb["title"]

Is this still the same for the full path of columns? Or does alb.title work?

Actually, this is the confusion that this issue is trying to avoid.

It separates these two cases:

References to things in global scope don't have a leading dot:

let albums = (...)

from albums.title
# `from column` does not make sense, focus on name resolution

References into subject of the current pipeline have a leading dot:

from albums
select .albums.title

So if you are able to refer to albums, you are still able to refer to albums.title.

aljazerzen · 2023-03-02T11:14:03Z

The implementation complexity hasn't changed enough to weigh into the decision here.

And sharp corners that you mention are intentional - a syntactical spotlight of semantics. So they are actually the main benefit. Think of it as the borrow checker in Rust.

But all that said, this change goes strongly against the concise nature of the language we've been able to maintain.

So my vote is -0.5.

snth · 2023-03-05T15:45:12Z

Thanks for trying this out @aljazerzen . Reading through your examples in #1619 (comment) I'm also struck by how there is this inconsistency between rvalues and the lvalues in derive and aggregate. Would it be possible to add the leading . for lvalues as well? (Not saying we should do this as we seemed to be converging on not going ahead with this proposal, just curious if it would be possible in theory since then we could restore consistency?)

Overall, I'm still unclear on the ephemeral vs scoped variables. I was seeing the .col as a shortcut for _frame.col and as such I thought it made some sense. It is quite different to what we/most people know from other SQL/database type systems but I think one could get used to it. The . is a relatively unobtrusive piece of punctuation so I personally don't feel that it gets in the way that much. I would still be open to it if we wanted to explore it more.

aljazerzen · 2023-03-05T18:38:24Z

Would it be possible to add the leading . for lvalues as well?

Yes, and it would be quite easy to do actually.

I'll take the liberty to interpret @snth's comment as a vote of +0. Total tally is -1.5, which means that we will not be adding this feature.

We can revisit it when there new features that would work well with this.

max-sixty · 2023-03-06T00:03:40Z

Great, thanks for the productive discussion and exploration effort.

max-sixty · 2023-05-01T19:16:12Z

I've been working with jq recently. They have a take of this, but I think with much easier semantics:

All data references use a leading period
The "root" namespace is just .
Then a column would be .date, or a reference into a struct would be .orders.address

So for example, the case above would be:

-from albums
+from .albums
select .albums.title

I think the from X is almost the only thing that changes from the full examples above — since the discriminant is whether it's referring to data, not the exact scope of the data.

max-sixty · 2024-03-03T20:16:22Z

As discussed on the call, I'm not sure my example was correct — instead .albums.title is already within the .albums scope, and so should be:

.title
...or $.albums.title
...or you could allow something like .. to go up a level — ..albums.title

-from albums
+from .albums
-select .albums.title
+select .title

bayareaunicorn · 2024-03-03T22:01:11Z

Great work

max-sixty · 2024-03-04T20:53:19Z

Reopening as this is under consideration again. Possibly we start a new issue synthesizing where we're at, given the amount of history though.

aljazerzen added language-design Changes to PRQL-the-language needs-discussion Undecided dilemma labels Jan 24, 2023

aljazerzen removed the needs-discussion Undecided dilemma label Feb 13, 2023

aljazerzen mentioned this issue Mar 1, 2023

feat: relative names #1996

Closed

aljazerzen closed this as completed Mar 5, 2023

aljazerzen mentioned this issue Mar 14, 2023

Column named timestamp escapes backticks #2155

Closed

max-sixty mentioned this issue Aug 7, 2023

Table name and transform name collisions #3271

Open

2 tasks

max-sixty reopened this Mar 4, 2024

max-sixty pinned this issue Mar 4, 2024

aljazerzen mentioned this issue Mar 9, 2024

feat: allow module in ident #4324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relative names #1619

Relative names #1619

aljazerzen commented Jan 24, 2023

snth commented Jan 24, 2023 •

edited

Loading

eitsupi commented Jan 28, 2023

max-sixty commented Feb 3, 2023

aljazerzen commented Feb 3, 2023

max-sixty commented Feb 3, 2023

aljazerzen commented Feb 4, 2023

max-sixty commented Feb 4, 2023

aljazerzen commented Feb 4, 2023

max-sixty commented Feb 4, 2023 •

edited

Loading

eitsupi commented Feb 5, 2023

aljazerzen commented Mar 1, 2023

max-sixty commented Mar 1, 2023

aljazerzen commented Mar 2, 2023

aljazerzen commented Mar 2, 2023

snth commented Mar 5, 2023 •

edited

Loading

aljazerzen commented Mar 5, 2023

max-sixty commented Mar 6, 2023

max-sixty commented May 1, 2023

max-sixty commented Mar 3, 2024 •

edited

Loading

bayareaunicorn commented Mar 3, 2024

max-sixty commented Mar 4, 2024

Relative names #1619

Relative names #1619

Comments

aljazerzen commented Jan 24, 2023

snth commented Jan 24, 2023 • edited Loading

eitsupi commented Jan 28, 2023

max-sixty commented Feb 3, 2023

aljazerzen commented Feb 3, 2023

max-sixty commented Feb 3, 2023

aljazerzen commented Feb 4, 2023

max-sixty commented Feb 4, 2023

aljazerzen commented Feb 4, 2023

max-sixty commented Feb 4, 2023 • edited Loading

eitsupi commented Feb 5, 2023

aljazerzen commented Mar 1, 2023

max-sixty commented Mar 1, 2023

aljazerzen commented Mar 2, 2023

aljazerzen commented Mar 2, 2023

snth commented Mar 5, 2023 • edited Loading

aljazerzen commented Mar 5, 2023

max-sixty commented Mar 6, 2023

max-sixty commented May 1, 2023

max-sixty commented Mar 3, 2024 • edited Loading

bayareaunicorn commented Mar 3, 2024

max-sixty commented Mar 4, 2024

snth commented Jan 24, 2023 •

edited

Loading

max-sixty commented Feb 4, 2023 •

edited

Loading

snth commented Mar 5, 2023 •

edited

Loading

max-sixty commented Mar 3, 2024 •

edited

Loading