Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative names #1619

Open
aljazerzen opened this issue Jan 24, 2023 · 21 comments
Open

Relative names #1619

aljazerzen opened this issue Jan 24, 2023 · 21 comments
Labels
language-design Changes to PRQL-the-language

Comments

@aljazerzen
Copy link
Member

Abstract:
I propose to change references to columns from column to .column.

Reasoning:
I'll try to explain how resolver works and how I think about semantics of name and variables in PRQL.

During resolving, there is a major distinction between scoped and ephemeral variables:

  • Scoped variables have a definition and live until their scope exists. For example, std.sum and std.select are global so they exist indefinitely, and function parameters exist only within function body.
  • Ephemeral variables are just references into some other argument of a current function call. For example, when you call select, all columns of the relation exist as variables during resolution of the first argument.

It is beneficial to distinguish these two mechanism, because of their subtle differences. For example take this query:

func my_transform rel -> (
    rel
    select [alb.title, artist_id]
)

from alb = albums
my_transfrom

Here, relation is constructed with from and within the relation a name alb is assigned all column from table albums. Note that alb is not a "real" value, it's just a namespace for the columns. When this relation is passed to my_transform, it is stored in the rel parameter. rel is now a scoped variable while alb.title is a reference to one of its columns.

I'm not sure if I've explained that well, please tell me if I haven't.

If I compare this behavior with, say, Python and a dataframe library, scoped variables are all normal idents, while ephemeral variables would be represented with strings. This is a bit more verbose and cannot provide good errors, typing or autocomplete. (This is feature of PRQL that dataframe libraries cannot copy. Only a custom language for relations can construct custom rules for name resolution.)

So because there is distinction in resolving, I suggest we add a distinction in syntax:

func my_transform rel -> (
    rel
    select [.alb.title, .artist_id]
)

from alb = albums
my_transfrom
sort .title

Pros:

  • distinction in syntax hints to the distinction in resolving
  • for newcomers, the rule is simple: columns start with a dot

Cons:

  • additional syntax we could be without
@aljazerzen aljazerzen added language-design Changes to PRQL-the-language needs-discussion Undecided dilemma labels Jan 24, 2023
@snth
Copy link
Member

snth commented Jan 24, 2023

I quite like the idea of a leading . for columns. I don't really know why yet but it feels like it would bring additional consistency. It also reminds me of JDOT (https://github.com/saulpw/jdot).

TBH, I did not understand the name resolution explanation yet but I will try again in the morning (it's close to midnight now). For example why is it alb.title in my_transform initially and not rel.title? And with the new syntax, why is it still .alb.title and not just .title (or if the alb is required then alb.title)?

Another possible benefit could be that it might disambiguate a column named "from" from the keyword from since the column would be referred to as .from. (IDK if this is currently a problem for the parser/compiler.)

@eitsupi
Copy link
Member

eitsupi commented Jan 28, 2023

It also reminds me of JDOT (https://github.com/saulpw/jdot).

Perhaps the origin is jq?
https://stedolan.github.io/jq/

I think jq is a very popular language for writing queries to json.

@max-sixty
Copy link
Member

Sorry to take a while to respond.

I think I'm understanding 85% of this,so forgive me if I'm slow.

I can see two points here;

  • discriminate between scoped and emphermal variables
  • use .foo for some variables

Re the discriminaring — how easy do you think it is to explain when to use a period vs. not to? I worry it's not easy! (But possibly we could make it easier).

Re the periods — I don't have a strong secular objection to it. It would be a big change, and I'm not sure it gets us that much apart from the discrimination. But it is an effective way of allowing columns to be clearly different from functions.


To what extent do you think it's accurate to describe emphermal variables as just having a scope that's limited to that line?

@aljazerzen
Copy link
Member Author

It just one point here: use .foo for ephemeral variables.

The rule for when to use the dot is simple: columns start with a dot.


describe emphermal variables as just having a scope that's limited to that line?

That's pretty accurate. But it may be confusing because even though the scope is limited to current function, almost identical scope could be created for next function in the pipeline.

@max-sixty
Copy link
Member

It just one point here: use .foo for ephemeral variables.

Totally, but is there an easy way to define ephemeral variables to beginners?

@aljazerzen
Copy link
Member Author

I'm saying that for beginners, ephemeral variables can be equivalent to columns. So the whole rule is columns start with a dot. And we don't even mention ephemeral variables.

That's because we don't have anything other than relations that we'd want to have references into. Maybe in the future, we could add support for referencing properties of JSON objects or structs.

@max-sixty
Copy link
Member

Yes OK, that is complete in the examples above.

How about when it's a variable; for example:

func add a b -> a + b
# or
func add a b -> .a + .b
# or
func add .a .b -> .a + .b

Thanks for bearing with me...

@aljazerzen
Copy link
Member Author

Oh, params are scoped variables so they don't need a leading dot. So like this:

func add a b -> a + b

func latest n rel -> (rel | sort [-.changed_at] | take n)

# rel and n are params -> scoped -> no dot
# .changed_at is a column (reference "into" rel) -> ephemeral variable -> dot 

@max-sixty
Copy link
Member

max-sixty commented Feb 4, 2023

OK great, I see, thanks.

I think it's tractable. I don't think it's that friendly, and it's much more alien for those who are used to SQL.

Do others share a concern that represents hierarchies inconsistently? For example alb is a relation. But to go into that hierarchy involves adding a period at its start; i.e. .alb.title. Generally to move down a hierarchy we'd only add things onto the end like alb.title or alb["title"]


I think this is insightful, and maybe we should discuss it more in our docs...

If I compare this behavior with, say, Python and a dataframe library, scoped variables are all normal idents, while ephemeral variables would be represented with strings. This is a bit more verbose and cannot provide good errors, typing or autocomplete. (This is feature of PRQL that dataframe libraries cannot copy. Only a custom language for relations can construct custom rules for name resolution.)

....I've heard this referred to as "bare words". I find it a great advantage of PRQL over something like python. It makes sense that we promote columns to not require quotes, since columns are so important in tabular data; they're almost like variables to us.

As @eitsupi points out, jq uses the .foo syntax, and that's worked well, though they use it all the way down the hierachy; i.e. .alb, never just alb.


So my current view is:

  • Has some nice properties
  • Concern about friendliness / alien-ness (but shouldn't be weighed highly unless this is a consensus view)
  • Concern about hierarchies

How important do you think it is for the development of the lang? Can we instead have a hierarchy of scopes (like many langs do), and resolve ephemeral variables first, and scoped variable after that?

@eitsupi
Copy link
Member

eitsupi commented Feb 5, 2023

I recall that in dplyr, it is sometimes difficult to distinguish between variables outside the data frame and column names in the data frame, making the behavior confusing.

cyl <- 10

mtcars |>
  dplyr::mutate(new = cyl * 10)

It can be specified explicitly by .data or .env (but many people rarely do this because it increases the amount of writing).
https://rlang.r-lib.org/reference/dot-data.html

cyl <- 10

mtcars |>
  dplyr::mutate(new = .data$cyl * 10)

I think it is a good balance of clarity and ease of writing to always start column names with a dot.

@aljazerzen aljazerzen removed the needs-discussion Undecided dilemma label Feb 13, 2023
@aljazerzen
Copy link
Member Author

I've implemented the proposal and converted the tests in prql-compiler.

Here are a few examples:

from daily_orders
sort .day
group .month (sort .num_orders | window expanding:true (derive rank))
derive [num_orders_last_week = lag 7 .num_orders]
from employees
derive rn = row_number
filter .rn > 2
from employees
derive age = .year_born - s'now()'
select [
    f"Hello my name is {.prefix}{.first_name} {.last_name}",
    f"and I am {.age} years old."
]
from employees
derive count = 12
select [
    twelve = .count,
    aggregated = count,
    aggregated_verbose = std.count,
]

Here is my findings:

  • this syntax is more verbose and less beginner-friendly than what we had before,
  • it simplifies the implementation a bit,
  • in some cases it is less ambiguous (see last example),
  • it would be nice for auto-complete, since typing . would bring up just columns for current relation,
  • there is a bit of inconsistency where we derive new names without the dot, but reference them with the dot,
  • we can now use .* to refer to all columns of the relation, where before we could not use * (since that would be parsed as multiplication).

Possible alternatives:

  • the leading dot is not required, but just encouraged,
  • the special leading dot syntax is replaced with a full name, like rel.first_name instead of just .first_name. In this case, rel. prefix would also be optional.

@max-sixty
Copy link
Member

Thanks for the list of findings, that's v helpful to anchor around.

Do others share a concern that represents hierarchies inconsistently? For example alb is a relation. But to go into that hierarchy involves adding a period at its start; i.e. .alb.title. Generally to move down a hierarchy we'd only add things onto the end like alb.title or alb["title"]

Is this still the same for the full path of columns? Or does alb.title work?


I think the .col syntax is fine from a blank slate, but — overall, in the current state I'm fairly strongly -1.

  • It's a very large change
  • The benefits don't seem that high. I do weigh compiler simplicity highly, since it lets us move faster with a wider group of contributors. But how great a simplification is it / do we think it would let us do much more much faster? (I might be underweighing the extent of the simplification)
  • There are some quite sharp corners IMO — the violation of the hierarchy as above, and the the lack of coherence between lvalues and rvalues ("there is a bit of inconsistency where we derive new names without the dot, but reference them with the dot,"). I think these could be confusing for newcomers.
    • An example of this in jq, which uses dots, but is consistent across these

One lens to view this is what we'd write in the Changelog — I'm not sure what we'd write that I'd feel great about...

@aljazerzen
Copy link
Member Author

Do others share a concern that represents hierarchies inconsistently? For example alb is a relation. But to go into that hierarchy involves adding a period at its start; i.e. .alb.title. Generally to move down a hierarchy we'd only add things onto the end like alb.title or alb["title"]

Is this still the same for the full path of columns? Or does alb.title work?

Actually, this is the confusion that this issue is trying to avoid.

It separates these two cases:

References to things in global scope don't have a leading dot:

let albums = (...)

from albums.title
# `from column` does not make sense, focus on name resolution

References into subject of the current pipeline have a leading dot:

from albums
select .albums.title

So if you are able to refer to albums, you are still able to refer to albums.title.

@aljazerzen
Copy link
Member Author

The implementation complexity hasn't changed enough to weigh into the decision here.

And sharp corners that you mention are intentional - a syntactical spotlight of semantics. So they are actually the main benefit. Think of it as the borrow checker in Rust.

But all that said, this change goes strongly against the concise nature of the language we've been able to maintain.

So my vote is -0.5.

@snth
Copy link
Member

snth commented Mar 5, 2023

Thanks for trying this out @aljazerzen . Reading through your examples in #1619 (comment) I'm also struck by how there is this inconsistency between rvalues and the lvalues in derive and aggregate. Would it be possible to add the leading . for lvalues as well? (Not saying we should do this as we seemed to be converging on not going ahead with this proposal, just curious if it would be possible in theory since then we could restore consistency?)

Overall, I'm still unclear on the ephemeral vs scoped variables. I was seeing the .col as a shortcut for _frame.col and as such I thought it made some sense. It is quite different to what we/most people know from other SQL/database type systems but I think one could get used to it. The . is a relatively unobtrusive piece of punctuation so I personally don't feel that it gets in the way that much. I would still be open to it if we wanted to explore it more.

@aljazerzen
Copy link
Member Author

Would it be possible to add the leading . for lvalues as well?

Yes, and it would be quite easy to do actually.


I'll take the liberty to interpret @snth's comment as a vote of +0. Total tally is -1.5, which means that we will not be adding this feature.

We can revisit it when there new features that would work well with this.

@max-sixty
Copy link
Member

Great, thanks for the productive discussion and exploration effort.

@max-sixty
Copy link
Member

I've been working with jq recently. They have a take of this, but I think with much easier semantics:

  • All data references use a leading period
  • The "root" namespace is just .
  • Then a column would be .date, or a reference into a struct would be .orders.address

So for example, the case above would be:

-from albums
+from .albums
select .albums.title

I think the from X is almost the only thing that changes from the full examples above — since the discriminant is whether it's referring to data, not the exact scope of the data.

@max-sixty
Copy link
Member

max-sixty commented Mar 3, 2024

As discussed on the call, I'm not sure my example was correct — instead .albums.title is already within the .albums scope, and so should be:

  • .title
  • ...or $.albums.title
  • ...or you could allow something like .. to go up a level — ..albums.title
-from albums
+from .albums
-select .albums.title
+select .title

@bayareaunicorn
Copy link

Great work

@max-sixty
Copy link
Member

Reopening as this is under consideration again. Possibly we start a new issue synthesizing where we're at, given the amount of history though.

@max-sixty max-sixty reopened this Mar 4, 2024
@max-sixty max-sixty pinned this issue Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants