Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definitions.merge and don't immediately validate definitions #21746

Closed
wants to merge 1 commit into from

Conversation

sryza
Copy link
Contributor

@sryza sryza commented May 9, 2024

Summary & Motivation

What

This PR proposes turning Definitions into a "dumb" data class that primarily holds collections of definitions. And it adds a static merge method for merging multiple Definitions objects into a single Definitions object.

A big change here is that we no longer validate that a Definitions object is loadable as a code location at construction time. Instead, a validate_loadable method is offered. This enables:

  • Putting resource definitions in one Definitions object and asset definitions in another Definitions object and later merging them together.
  • Putting unresolved asset jobs in one Definitions object and asset definitions in another Definitions object and later merging them together.

A beneficial side effect of this is that less work will happen at module import time. This gives users more flexibility to determine where they want to incur the cost of this validation. It also likely helps move towards a future world where run/step workers only need to load/validate the subset of definitions that they need.

An alternative could be introducing something like an UnresolvedDefinitions class. Though I'm having trouble coming up with a name that makes this feel ergonomic.

Why

There are diverse situations where it's useful to be able to pass around collections of definitions. Doing this in Dagster right now is quite awkward.

Relevant situations:

How I Tested These Changes

@sryza sryza changed the title Definitions.merge and don't immediately validate definitions RFC: Definitions.merge and don't immediately validate definitions May 9, 2024
@@ -326,7 +328,7 @@ class BindResourcesToJobs(list):
"""


class Definitions:
class Definitions(DagsterModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I punted on thinking hard about whether this should be a DagsterModel or a NamedTuple. Don't have a strong opinion here.

self.get_inner_repository()

@staticmethod
def merge(*def_sets: "Definitions") -> "Definitions":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I punted on thinking hard about whether this should be * args or a list. I don't have a strong opinion here.


def get_asset_graph(self) -> AssetGraph:
"""Get the AssetGraph for this set of definitions."""
return self.get_repository_def().asset_graph

def validate_loadable(self) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially this should return self so that users can do:

defs = Definitions(...).validate_loadable()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential thing we can do here is make a trivial subclass of Definitions which is ValidatedDefinitions, and make this function def validated(self) -> ValidatedDefinitions:.

That way, we can use the type system to enforce that we're working with a guaranteed-to-be-valid definitions object at certain points in time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewType!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like this idea. But couldn't find places in the codebase where we would use it yet. So planning to punt on this in this PR. Let me know if you think we should do otherwise.

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is long past due and I am very excited that you are pressing on this 👍🏻

I think the thing you need to press on here is the resource system. It is going to be the hardest part of this in my estimation. Namely what is the behavior if you merge Definitions objects that have overlapping resource keys. If the user want to merge them, and they do not have control over the definitions they are importing, what is the resolution here? Do we somehow scope them?

Namely imagine two assets that have the same resource key, but expect different types.

from dagster import asset

# imagine importing the "clients" from third party libraries you cannot change
class ClientFromLibOne:
    def execute(...) -> None

class ClientFromLibTwo:
    def invoke(...) -> None:

@asset
def asset_one(client: ResourceParam[ClientFromLibOne]):
    client.execute()

@asset
def asset_two(client: ResourceParam[ClientFromLibTwo]):
     client.invoke()

Definitions(asset_one, resources={"client": ClientFromLibOne()}).merge(Definitions(asset_two, resources={"client_two": ClientFromLibTwo()})

Imagine those assets came from factories that the user doesn't have control either. An internal module owned by another team or something. What is the behavior? How can they get this to work.

@sryza
Copy link
Contributor Author

sryza commented May 13, 2024

@schrockn I see a couple options here:

  • Introduce some sort of resource scoping
  • Introduce some sort of resource mapping

After writing them out, I’m going to make the argument that we should expect users to deal with avoiding collisions on their own, without framework help, for the time being.

Options

Resource scoping

Possible directions:

  • Resources are scoped to asset groups
  • Definitions objects are more than just bags of definitions, but rather correspond to some sort of new definition-grouping abstraction. Resources are scoped to this abstraction.

Risks:

  • Scoping resources to asset groups would be a breaking change, and it wouldn't help with sensor resources
  • Dagster already has a lot of grouping abstractions
  • In many of the use cases I’ve been imagining for Definitions.merge, I think it’s important for definitions to be able to get resources from outside their enclosing Definitions object. E.g.
    • Module 1 defines some assets, checks, and jobs → packages them into object Definitions
    • Module 2 defines some assets, checks, and jobs → packages them into a Definitions object
    • These are merged at the code location level, along with some resources

Resource mapping

Objects that rely on resources can have both an "inner" resource key and an "outer" resource key:

  • The inner resource key would be used to access the resource within the compute function
  • The outer resource key would be used to fetch the resource from the code location

These objects each have a map_resource_keys(inner_resource_keys_by_outer: Mapping[str, str]) method that returns a new instance with its resources remapped. Definitions has a similar map_resource_keys method that applies to all the definitions inside.

Risks:

  • Extra complexity inside Dagster - need to implement this across assets, ops, jobs, sensors, etc. Requires making copies of ops, which can pose challenge with op name uniqueness constraints.
  • A user looking at a compute function can no longer easily determine which resource in the UI it corresponds to, because it might have been remapped in between.

Expecting users to deal with collisions on their own

Because of their risks, I don’t think we should do either of the above options right now. Rather, I believe that we should expect libraries that expose definition factories to allow users to determine the resource keys that those definitions depend on. Our dbt and dlt integrations already do this.

This direction is a two way door that gives us the option to introduce resource remapping or off-by-default scoping in the future.

Definitions-merging is a way to make it easier for users to build the kinds of code locations that they’re already building. It doesn’t add any fundamentally new capability. It’s not an encouragement for users to starting merging code locations that they would have otherwise kept separate.

When we initially allowed and introduced with_resources for assets, we suspected that introducing resource remapping was around the corner. I.e. we worried that the "one value per resource key per repository" constraint would prove too restrictive and that users would need some way of getting out of it. But I don’t think this panned out empirically. If you search for “Conflicting versions of resource with key”, you can find instances of the problem. It’s not non-existent – e.g. it comes up with our Airbyte integration one or two times, but my read is that it doesn’t appear pressing enough to introduce a layer of scoping or indirection that users need to navigate.

@sryza sryza force-pushed the definitions-merge branch 2 times, most recently from 1edb52b to 872416e Compare May 13, 2024 22:33
@schrockn
Copy link
Member

schrockn commented May 14, 2024

Awesome thanks for writing this up @sryza. Given the options presented definitely agree with your assessment.

I'd like to throw out another possibility: Immediately binding resources to whatever definitions are passed into Definitions, and then letting the internal framework just handle collisions.

defs_1 = Definitions([[some_def], resources={"a": AInstance(some_value=1)})

defs_2 = Definitions([[another_def], resources={"a": AInstance(some_value=2)})

def = defs_1.merge(defs_2) 

At runtime, some_def gets AInstance(some_value=1) and another_def gets AInstance(some_value=2).

I think this would be intuitive for users, but impose some framework complexity as we would need to bookkeep the scoping. We would also need to rethink the Resource UI, which I think is since it isn't that great or central right now.

@sryza
Copy link
Contributor Author

sryza commented May 14, 2024

@schrockn fwiw I'd slot that into the "resource scoping" family of solutions I discussed above – it scopes resources to the Definitions object that they're supplied to, rather than applying them everywhere. Definitely a lighter-weight framing though than either of the bullets that I put there.

The main challenges I see here:

  • To give users observability into where their resources are coming from, I believe we'd need to introduce some sort of organizational abstraction. E.g. imagine the URL that gets you to the page for AInstance(some_value=2). And imagine the text in the link to that page, when you're looking at the page for another_def. There would need to be some string that disambiguates the different instance of "a".
  • We'd no longer be able to describe a Definitions object as a simple "bag of definitions". Instead it would be "A bag of definitions except for resources. The resources are melted in at construction time.".
  • Definitions.merge would probably need to accept resources (this is not necessarily bad - this bullet is more of an implication than a challenge).

More of an internal issue than a user-facing issue, but I think this could also be very heavy lift, because assets from different Definitions objects could be executed in the same run, so we would need to carry the scoping deep down.

@schrockn
Copy link
Member

To give users observability into where their resources are coming from, I believe we'd need to introduce some sort of organizational abstraction. E.g. imagine the URL that gets you to the page for AInstance(some_value=2). And imagine the text in the link to that page, when you're looking at the page for another_def. There would need to be some string that disambiguates the different instance of "a".

100%. This would require frontend/product/tooling work

We'd no longer be able to describe a Definitions object as a simple "bag of definitions". Instead it would be "A bag of definitions except for resources. The resources are melted in at construction time.".

I actually don't think this would be that bag. I think we would frame it was immediately binding resources to any definition, and subsequent merges can override.

Definitions.merge would probably need to accept resources (this is not necessarily bad - this bullet is more of an implication than a challenge).

Yup 100%

@schrockn
Copy link
Member

More of an internal issue than a user-facing issue, but I think this could also be very heavy lift, because assets from different Definitions objects could be executed in the same run, so we would need to carry the scoping deep down.

Yes exactly. This what I mean by "impose some framework complexity"

@schrockn
Copy link
Member

schrockn commented May 14, 2024

Tl;dr I think we should move forward with this proposal with the understanding that we might need to add support for resources merging, which I believe we can layer on additively.

Let me know when PR is ready for review and will take a look!

@sryza sryza force-pushed the definitions-merge branch 2 times, most recently from 700b1fa to befbb17 Compare May 20, 2024 15:59
@sryza sryza changed the title RFC: Definitions.merge and don't immediately validate definitions Deinitions.merge and don't immediately validate definitions May 20, 2024

def get_asset_graph(self) -> AssetGraph:
"""Get the AssetGraph for this set of definitions."""
return self.get_repository_def().asset_graph

@public
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this PR proposes making this method public and non-experimental. This is to avoid a regression in users' ability to validate their definitions.

So somewhat high stakes.

@sryza
Copy link
Contributor Author

sryza commented May 20, 2024

@schrockn @OwenKephart – taking this out of RFC state and should be ready for another round of review.

branch-name: definitions-merge
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this is a big mental model change, I think it is important to work through how we would explain this in docs, but now and in a future world of blueprints/blocks etc.

Before we commit this, I think we should write prose that accompanies an example where a user constructs more than one Definitions objects, merges them, and then finally they get "bound" (or whatever) to produce the usable definition objects that live in repositories.

Right now I don't think we could write that prose with the code as it exists within this PR.

Comment on lines +644 to +657
def validate_loadable(self) -> "Definitions":
"""Validates that the enclosed definitions will be loadable by Dagster:
- No assets have conflicting keys.
- No jobs, sensors, or schedules have conflicting names.
- All asset jobs can be resolved.
- All resource requirements are satisfied.

Raises an error if any of the above are not true.

Returns:
Definitions: The definitions object, unmodified.
"""
self.get_inner_repository()
return self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not a fan of this pattern where a user is just magically expected to call this and it is unencoded in the type system. I'd prefer a top level utility function instead of a method. The existence of this sort of validate method implies that a user is supposed to call before using it or something, but that it is totally unenforced here.

A top-level staticmethod would be preferable in my view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A top-level staticmethod would be preferable in my view.

Sounds good - I'll give that a whirl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I'm imagining the main use case for this method is in unit tests. I.e. I'm not expecting that everyone would call it in their definitions.py.

@sryza sryza changed the title Deinitions.merge and don't immediately validate definitions Definitions.merge and don't immediately validate definitions May 20, 2024
@sryza
Copy link
Contributor Author

sryza commented May 21, 2024

I think we should write prose that accompanies an example where a user constructs more than one Definitions objects, merges them, and then finally they get "bound" (or whatever) to produce the usable definition objects that live in repositories.

@schrockn here's a take at this: https://www.notion.so/dagster/Definitions-merge-prose-e014320e59004417a07e11cf60c8ffea?showMoveTo=true&saveParent=true

@sryza
Copy link
Contributor Author

sryza commented May 21, 2024

And here's a start at using this in DOP: https://github.com/dagster-io/internal/pull/9809/files.

It's only half-done, but it makes the definitions.py a lot more calm IMO.

@schrockn
Copy link
Member

It's only half-done, but it makes the definitions.py a lot more calm IMO.

100% that diff looks awesome.

@schrockn
Copy link
Member

@schrockn here's a take at this: https://www.notion.so/dagster/Definitions-merge-prose-e014320e59004417a07e11cf60c8ffea?showMoveTo=true&saveParent=true

I commented in the Notion but will do it here for posterity etc.

Two thoughts.

  • I wonder if there should be either a) differnet top-level verb the act of merging + definitively binding resources or 2) a different static ctor for Definitions for the use cases where we don't want to pre-emptively validate it. I like 2) because it is opt-in and changes no behavior for existing code, whereas the current PR is quite the behavior change. Definitions.bundle or something.

  • The other place this could shine in a callback- or class-based API where the user defines a function that returns Definitions, which we validate on the user's behalf.

@schrockn
Copy link
Member

And it's a little new terminology but this is an advanced, opt-in use case so I think that is ok.

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to your queue. Pretty firmly convinced myself that we shouldn't change existing behavior and instead add a new static ctor that defers resource binding.

@sryza
Copy link
Contributor Author

sryza commented May 21, 2024

instead add a new static ctor that defers resource binding.

How are you envisioning that we'd implement this? A _validate_loadable: bool = True property on the real constructor that gets set to false by the static constructor?

I have some resistance to this approach because I think it violates user expectations about the relationship between static constructors and the underlying constructor. I.e. I think the expectation is that static constructors will call the real constructor internally. And any non-optional validation within the real constructor is expressing fundamental constraints that instances of the class are expected to abide by.

Stepping back a bit, my understanding of the main reason to frontload this kind of validation is to be able to catch errors earlier. In the standard case where someone is pointing a code location at a Definitions object and starting the web server or the CLI or whatever, I don't think validating at object construction time helps errors get caught earlier in a meaningful way: the validation happens anyway immediately after object construction, when Dagster tries to load it.

I think the main place where deferring validation makes a difference is in unit tests. For that case, there will likely be some users who experience negative effects from this change until they update their tests to explicitly call the validation method. However, whatever choices we make here will be our APIs forever, so I think that some temporary user pain is worth it for an saner API.

@schrockn
Copy link
Member

Just spitballing here, I would do some sort of yucky hack, but a super localized one, if we don't want to introduce another class.

class Definitions:

   @staticmethod
   # would actually spell out args for type checking
   def bundle(self, **kwargs):
      with disable_eager_construction():
         return Definitions(**kwargs)
     

   # `__init__` identical signature as before
   # can disable eager construction with disable_eager_construction
   def __init__(self, ...): ...

I think there are tons of reasons to want to do validate immediately in test cases, scripting, and other contexts outside of the webserver and other tools. One of the nice things about having Definitions be an instance is how easy it is to create in test cases etc. I think we should maintain that property.

@sryza
Copy link
Contributor Author

sryza commented May 21, 2024

From an internal perspective, I don't have any problem with maintaining the kind of hack you're suggesting. However, I think that the presence of / need for that hack is indicative of the way that the API would be subverting user expectations.

IMO the mental model we expose to users will be a lot simpler if we can say "A Definitions object is just a bag of definitions. Full stop.". And then layer any validation behavior on top of that.

@schrockn
Copy link
Member

IMO the mental model we expose to users will be a lot simpler if we can say "A Definitions object is just a bag of definitions. Full stop.". And then layer any validation behavior on top of that.

Yeah understood I'm just concerned about shipping such a big behavior change.

@sryza
Copy link
Contributor Author

sryza commented May 21, 2024

If we're not comfortable shipping that behavior change, brainstorming directions we could go in:

  • Introduce a new type that represents a bag of definitions, without any assurances that those definitions are loadable. Potential names:
    • UnvalidatedDefinitions
    • UnresolvedDefinitions
    • UnboundDefinitions
    • DefinitionsWithoutResources
  • Introduce a public validate_loadable: bool = True argument to the Definitions constructor.
  • Expect Definitions objects to be fully bound before merging. I.e. expect people to import their resources etc. in sub-modules. E.g.:
# submodule1.py
from .resources import snowflake_resource

submodule1_defs = Definitions(assets=[asset1], resources={"snowflake": SnowflakeResource()})

# submodule2.py
from .resources import snowflake_resource

submodule2_defs = Definitions(assets=[asset2], resources={"snowflake": SnowflakeResource()})

# definitions.py
from .submodule1 import submodule1_defs
from .submodule2 import submodule2_defs

defs = Definitions.merge(submodule1_defs, submodule2_defs)

The last one is actually pretty tempting to me. One drawback is that it would make it harder to create something like this:

class DbtJobBlueprint(Blueprint):
    name: str
    select: str

    def build_defs(self) -> Definitions:
        job = define_asset_job(name=self.name, selection=DbtManifestAssetSelection(self.select))
        return Definitions(jobs=[job])

@schrockn
Copy link
Member

The last one is actually pretty tempting to me. One drawback is that it would make it harder to create something like this:

For me as well. I think it might provided some good "guardrails" that nudges people to not have massive dictionaries of resources, but instead bind resources as close to definitions that consume them as possible, which might be good. At a minimum it is an interesting place to start, and it is purely an additive capability.

For the Blueprint use case, I recommend that we introduce a blueprint-specific class (for now) to unblock progress there. We can easily re-examine later. I also suspect there will unique constraints there especially as we make an effort to make those definitions dynamically reloadable. Separating that from mainline Definitions is a at this point is feature not a bug at this point.

@sryza
Copy link
Contributor Author

sryza commented May 22, 2024

@schrockn definitely agree that to unblock the blueprints work, need a separate class. I've already added that here: https://github.com/dagster-io/dagster/pull/21980/files#diff-ccaf56ec57423d804ed68d114ca1c21c033a919326296f40f7ac0f215d4343e2R28.

My hope is still that we can eliminate that class entirely before going big with blueprints. A simple bag of definitions feels like such a basic and widely applicable object.

@schrockn
Copy link
Member

@schrockn definitely agree that to unblock the blueprints work, need a separate class. I've already added that here: https://github.com/dagster-io/dagster/pull/21980/files#diff-ccaf56ec57423d804ed68d114ca1c21c033a919326296f40f7ac0f215d4343e2R28.

My hope is still that we can eliminate that class entirely before going big with blueprints. A simple bag of definitions feels like such a basic and widely applicable object.

100%. I just don't want work on Blueprints to be blocked for even one second based on this PR. Agree that eliminating the special blueprint-only should be done before launching Blueprints to more than a few design partners.

@sryza sryza mentioned this pull request May 22, 2024
@sryza
Copy link
Contributor Author

sryza commented May 22, 2024

To preserve the original description here for historical purposes, spinning up a new PR for the version where we still do immediately validate definitions: #22035.

@sryza sryza closed this May 22, 2024
sryza added a commit that referenced this pull request May 23, 2024
## Summary & Motivation

There are diverse situations where it's useful to be able to pass around
collections of definitions. Doing this in Dagster right now is quite
awkward.

Relevant situations:
- Writing a factory that returns a few assets, a few checks, and a job
that executes them together.
- A submodule defines a few assets, a job, and a schedule. A different
submodule does the same. They should all be in the same code location.
-
#21387 (comment)
- dagster-io/internal#8216

Unlike the [prior version of this
PR](#21746), this PR does
_not_ turn `Definitions` into a "dumb" data class. I.e. it still
validates that all assets and jobs can be bound, at construction time.

## How I Tested These Changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants