Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve API for .execute() all nodes #836

Closed
zilto opened this issue Apr 23, 2024 · 10 comments
Closed

Improve API for .execute() all nodes #836

zilto opened this issue Apr 23, 2024 · 10 comments
Labels

Comments

@zilto
Copy link
Collaborator

zilto commented Apr 23, 2024

Problem

Currently, the primary way to use Driver.execute() is to pass a list of node names (str). It's useful to get all existing nodes for various reasons:

  • iterative development
  • debugging
  • create data quality check
  • write tests
    But managing a list of strings is likely to be incomplete, is typo-provide, and can't provide IDE support

Current solution

Use Driver.list_available_variables() to get node names programmatically, filter them to exclude config nodes, and pass them to Driver.execute().

Given "getting all dataflow nodes" is likely the most common usage, it's odd to receive a list and systematically have to do list comprehension over it. I always forget the API to filter out config nodes.

Issues:

  • it doesn't tell you that you need to filter out config nodes, and it's API could make it easier
  • it's hard to find
  • the "variable" terminology is outdated and the documentation should be updated

Potential solution

  • Update docstring to use "node" terminology
  • create a new method .available_nodes() and issue deprecation warning for direct .list_available_variables() calls
  • have a separate getter for .config()
@zilto zilto added the refactor label Apr 23, 2024
@elijahbenizzy
Copy link
Collaborator

Hmm, config nodes shouldn't be in list_available_variables... The fact that they show up is a bit of a hack, no?

@zilto
Copy link
Collaborator Author

zilto commented Apr 24, 2024

Looking at git blame and going off memory, .list_available_variables() used the Variable construct. Now, the driver module defines Variable = graph_types.HamiltonNode to standardize the API and deprecate Variable. One would expect to receive nodes from HamiltonGraph.nodes, but this wasn't associated with changes to .list_available_variables() which uses the FunctionGraph through self.graph.

The current

def list_available_variables(self, *, tag_filter=None) -> List[Variable]:
        all_nodes = self.graph.get_nodes()
        if tag_filter:
            valid_filter_values = all(
                map(
                    lambda x: isinstance(x, str)
                    or (isinstance(x, list) and len(x) != 0)
                    or x is None,
                    tag_filter.values(),
                )
            )
            if not valid_filter_values:
                raise ValueError("All tag query values must be a string or list of strings")
            results = []
            for n in all_nodes:
                if node.matches_query(n.tags, tag_filter):
                    results.append(Variable.from_node(n))
        else:
            results = [Variable.from_node(n) for n in all_nodes]
        return results

should become

def list_available_nodes(self, *, tag_filter=None) -> List[HamiltonNode]:
     g = graph_types.HamiltonGraph.from_graph(self.graph)
     all_nodes = g.nodes
     
     # reimplement node filter
     
     filtered_nodes = ...
     return filtered_nodes

@elijahbenizzy
Copy link
Collaborator

Looking at git blame and going off memory, .list_available_variables() used the Variable construct. Now, the driver module defines Variable = graph_types.HamiltonNode to standardize the API and deprecate Variable. One would expect to receive nodes from HamiltonGraph.nodes, but this wasn't associated with changes to .list_available_variables() which uses the FunctionGraph through self.graph.

The current

def list_available_variables(self, *, tag_filter=None) -> List[Variable]:
        all_nodes = self.graph.get_nodes()
        if tag_filter:
            valid_filter_values = all(
                map(
                    lambda x: isinstance(x, str)
                    or (isinstance(x, list) and len(x) != 0)
                    or x is None,
                    tag_filter.values(),
                )
            )
            if not valid_filter_values:
                raise ValueError("All tag query values must be a string or list of strings")
            results = []
            for n in all_nodes:
                if node.matches_query(n.tags, tag_filter):
                    results.append(Variable.from_node(n))
        else:
            results = [Variable.from_node(n) for n in all_nodes]
        return results

should become

def list_available_nodes(self, *, tag_filter=None) -> List[HamiltonNode]:
     g = graph_types.HamiltonGraph.from_graph(self.graph)
     all_nodes = g.nodes
     
     # reimplement node filter
     
     filtered_nodes = ...
     return filtered_nodes

Sure, happy with the new name. What I'm saying is that having config nodes even show up at all is weird, we should filter by default (and figure out why they're included at all...).

@skrawcz
Copy link
Collaborator

skrawcz commented Apr 25, 2024

Config becomes nodes in the graph -- and you have always been able to request config/inputs as outputs. So not weird to me. They should have the appropriate metadata to distinguish them (which I think they do).

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Apr 25, 2024

Config becomes nodes in the graph -- and you have always been able to request config/inputs as outputs. So not weird to me. They should have the appropriate metadata to distinguish them (which I think they do).

IMO that doesn't make sense. And IIRC this was a hack. The way this should work is:

  1. Any transforms become nodes (the default)
  2. Any "external inputs" become nodes (specified by .is_external_input or something)
  3. Any config that is used as an input gets passed in as an external input through the way the graph processes (which is why you can request them...)

Config is used at a different stage -- compile time. Why would it be a node? It's a property of the graph, not of execution. It should be a field in the HamiltonGraph object and displayed separately.

@skrawcz
Copy link
Collaborator

skrawcz commented Apr 26, 2024

config is a place to put invariant inputs as well; i.e. "config", the dictionary passed in for it, can be a runtime input. It's always been that way.

@elijahbenizzy
Copy link
Collaborator

config is a place to put invariant inputs as well; i.e. "config", the dictionary passed in for it, can be a runtime input. It's always been that way.

Yes, but that doesn't mean we should represent it as a node, and definitely shouldn't include jt with list_available_vars

@skrawcz
Copy link
Collaborator

skrawcz commented Apr 26, 2024

config is a place to put invariant inputs as well; i.e. "config", the dictionary passed in for it, can be a runtime input. It's always been that way.

Yes, but that doesn't mean we should represent it as a node, and definitely shouldn't include jt with list_available_vars

It is and has always been a node in the graph. It's clearly marked that it doesn't come from a function. You can request it as an output.

@elijahbenizzy
Copy link
Collaborator

config is a place to put invariant inputs as well; i.e. "config", the dictionary passed in for it, can be a runtime input. It's always been that way.

Yes, but that doesn't mean we should represent it as a node, and definitely shouldn't include jt with list_available_vars

It is and has always been a node in the graph. It's clearly marked that it doesn't come from a function. You can request it as an output.

Blech I think this was a hack from the days when everything was passed in statically. See 813e098#diff-bfd5ce89c947f4ba81d03b804ba27d0fdc8686974b21bbbf9e37401080a67817R94.

There's a clear demarcation between (a) nodes that are explicitly inputs in the graph and (b) nodes that just exist because they happen to have been shoved into config. Don't see the value of (b) existing in the graph when we have (a) specified as function inputs, but we're adding them all because of a commit 4 years ago?

Best to add config here:

nodes: typing.List[HamiltonNode]
. Then anything config that is requested will show up, otherwise you can get the config with which it was constructed.

@zilto
Copy link
Collaborator Author

zilto commented Apr 26, 2024

Sure, happy with the new name. What I'm saying is that having config nodes even show up at all is weird, we should filter by default (and figure out why they're included at all...).

I agree with you. I was pointing out that it wasn't a "hack" but rather how it always behaved (i.e., need to manually filter out with .is_external). What got even more confusing is the recent overwrite of Variable by HamiltonNode type because config values are never converted to HamiltonNode elsewhere in the code (i.e., they're not in HamiltonGraph)

It is and has always been a node in the graph. It's clearly marked that it doesn't come from a function. You can request it as an output.

I know config can be used as "input values that are constant for the lifetime of a Driver", but I think it makes more confusing. It's much easier to explain config as "the thing that define the shape of the DAG but doesn't interact with values"

Driver/Builder defines the DAG, execute/materialize define execution values

I'm always surprised when reading it in examples or documentation, such as the numpy one

    dr = driver.Driver(
        {
            "input_file_name": "air-quality-data.csv",
            "after_lock_date": "2020-03-24T00",
            "before_lock_date": "2020-03-21T00",
        },
        analysis_flow,
        adapter=adapter,
    )

    output = ["t_value", "p_value", "before_sample", "after_sample"]
    # dr.visualize_execution(output, './my_file.dot', {})
    result = dr.execute(output)

For file paths and dates, I can intuitively make sense of them being "used as input", but for an int or string, I would just have to guess if it triggers a @config.when

Config becomes nodes in the graph -- and you have always been able to request config/inputs as outputs.

Nonetheless, this is related to the recent visualization PR I merged. Config nodes were only displayed in .display_all_functions() because they wouldn't be part of the execution path otherwise (even if they influenced the path of which node was executed).

With this viz PR, the above numpy example now shows input_file_name, after_lock_date, and before_lock_date on the display whether or not their values are read.

@zilto zilto closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants