Improve API for `.execute()` all nodes #836

zilto · 2024-04-23T14:46:34Z

Problem

Currently, the primary way to use Driver.execute() is to pass a list of node names (str). It's useful to get all existing nodes for various reasons:

iterative development
debugging
create data quality check
write tests
But managing a list of strings is likely to be incomplete, is typo-provide, and can't provide IDE support

Current solution

Use Driver.list_available_variables() to get node names programmatically, filter them to exclude config nodes, and pass them to Driver.execute().

Given "getting all dataflow nodes" is likely the most common usage, it's odd to receive a list and systematically have to do list comprehension over it. I always forget the API to filter out config nodes.

Issues:

it doesn't tell you that you need to filter out config nodes, and it's API could make it easier
it's hard to find
the "variable" terminology is outdated and the documentation should be updated

Potential solution

Update docstring to use "node" terminology
create a new method .available_nodes() and issue deprecation warning for direct .list_available_variables() calls
have a separate getter for .config()

The text was updated successfully, but these errors were encountered:

elijahbenizzy · 2024-04-23T16:55:45Z

Hmm, config nodes shouldn't be in list_available_variables... The fact that they show up is a bit of a hack, no?

zilto · 2024-04-24T18:15:49Z

Looking at git blame and going off memory, .list_available_variables() used the Variable construct. Now, the driver module defines Variable = graph_types.HamiltonNode to standardize the API and deprecate Variable. One would expect to receive nodes from HamiltonGraph.nodes, but this wasn't associated with changes to .list_available_variables() which uses the FunctionGraph through self.graph.

The current

def list_available_variables(self, *, tag_filter=None) -> List[Variable]:
        all_nodes = self.graph.get_nodes()
        if tag_filter:
            valid_filter_values = all(
                map(
                    lambda x: isinstance(x, str)
                    or (isinstance(x, list) and len(x) != 0)
                    or x is None,
                    tag_filter.values(),
                )
            )
            if not valid_filter_values:
                raise ValueError("All tag query values must be a string or list of strings")
            results = []
            for n in all_nodes:
                if node.matches_query(n.tags, tag_filter):
                    results.append(Variable.from_node(n))
        else:
            results = [Variable.from_node(n) for n in all_nodes]
        return results

should become

def list_available_nodes(self, *, tag_filter=None) -> List[HamiltonNode]:
     g = graph_types.HamiltonGraph.from_graph(self.graph)
     all_nodes = g.nodes
     
     # reimplement node filter
     
     filtered_nodes = ...
     return filtered_nodes

elijahbenizzy · 2024-04-25T16:52:35Z

Looking at git blame and going off memory, .list_available_variables() used the Variable construct. Now, the driver module defines Variable = graph_types.HamiltonNode to standardize the API and deprecate Variable. One would expect to receive nodes from HamiltonGraph.nodes, but this wasn't associated with changes to .list_available_variables() which uses the FunctionGraph through self.graph.

The current

def list_available_variables(self, *, tag_filter=None) -> List[Variable]:
        all_nodes = self.graph.get_nodes()
        if tag_filter:
            valid_filter_values = all(
                map(
                    lambda x: isinstance(x, str)
                    or (isinstance(x, list) and len(x) != 0)
                    or x is None,
                    tag_filter.values(),
                )
            )
            if not valid_filter_values:
                raise ValueError("All tag query values must be a string or list of strings")
            results = []
            for n in all_nodes:
                if node.matches_query(n.tags, tag_filter):
                    results.append(Variable.from_node(n))
        else:
            results = [Variable.from_node(n) for n in all_nodes]
        return results

should become

def list_available_nodes(self, *, tag_filter=None) -> List[HamiltonNode]:
     g = graph_types.HamiltonGraph.from_graph(self.graph)
     all_nodes = g.nodes
     
     # reimplement node filter
     
     filtered_nodes = ...
     return filtered_nodes

Sure, happy with the new name. What I'm saying is that having config nodes even show up at all is weird, we should filter by default (and figure out why they're included at all...).

skrawcz · 2024-04-25T16:57:31Z

Config becomes nodes in the graph -- and you have always been able to request config/inputs as outputs. So not weird to me. They should have the appropriate metadata to distinguish them (which I think they do).

elijahbenizzy · 2024-04-25T17:03:08Z

Config becomes nodes in the graph -- and you have always been able to request config/inputs as outputs. So not weird to me. They should have the appropriate metadata to distinguish them (which I think they do).

IMO that doesn't make sense. And IIRC this was a hack. The way this should work is:

Any transforms become nodes (the default)
Any "external inputs" become nodes (specified by .is_external_input or something)
Any config that is used as an input gets passed in as an external input through the way the graph processes (which is why you can request them...)

Config is used at a different stage -- compile time. Why would it be a node? It's a property of the graph, not of execution. It should be a field in the HamiltonGraph object and displayed separately.

skrawcz · 2024-04-26T01:46:24Z

config is a place to put invariant inputs as well; i.e. "config", the dictionary passed in for it, can be a runtime input. It's always been that way.

elijahbenizzy · 2024-04-26T03:22:39Z

config is a place to put invariant inputs as well; i.e. "config", the dictionary passed in for it, can be a runtime input. It's always been that way.

Yes, but that doesn't mean we should represent it as a node, and definitely shouldn't include jt with list_available_vars

skrawcz · 2024-04-26T05:29:29Z

config is a place to put invariant inputs as well; i.e. "config", the dictionary passed in for it, can be a runtime input. It's always been that way.

Yes, but that doesn't mean we should represent it as a node, and definitely shouldn't include jt with list_available_vars

It is and has always been a node in the graph. It's clearly marked that it doesn't come from a function. You can request it as an output.

elijahbenizzy · 2024-04-26T18:10:42Z

config is a place to put invariant inputs as well; i.e. "config", the dictionary passed in for it, can be a runtime input. It's always been that way.

Yes, but that doesn't mean we should represent it as a node, and definitely shouldn't include jt with list_available_vars

It is and has always been a node in the graph. It's clearly marked that it doesn't come from a function. You can request it as an output.

Blech I think this was a hack from the days when everything was passed in statically. See 813e098#diff-bfd5ce89c947f4ba81d03b804ba27d0fdc8686974b21bbbf9e37401080a67817R94.

There's a clear demarcation between (a) nodes that are explicitly inputs in the graph and (b) nodes that just exist because they happen to have been shoved into config. Don't see the value of (b) existing in the graph when we have (a) specified as function inputs, but we're adding them all because of a commit 4 years ago?

Best to add config here:

hamilton/hamilton/graph_types.py

Line 189 in 3e7626f

nodes: typing.List[HamiltonNode]

. Then anything config that is requested will show up, otherwise you can get the config with which it was constructed.

zilto · 2024-04-26T19:53:46Z

Sure, happy with the new name. What I'm saying is that having config nodes even show up at all is weird, we should filter by default (and figure out why they're included at all...).

I agree with you. I was pointing out that it wasn't a "hack" but rather how it always behaved (i.e., need to manually filter out with .is_external). What got even more confusing is the recent overwrite of Variable by HamiltonNode type because config values are never converted to HamiltonNode elsewhere in the code (i.e., they're not in HamiltonGraph)

It is and has always been a node in the graph. It's clearly marked that it doesn't come from a function. You can request it as an output.

I know config can be used as "input values that are constant for the lifetime of a Driver", but I think it makes more confusing. It's much easier to explain config as "the thing that define the shape of the DAG but doesn't interact with values"

Driver/Builder defines the DAG, execute/materialize define execution values

I'm always surprised when reading it in examples or documentation, such as the numpy one

    dr = driver.Driver(
        {
            "input_file_name": "air-quality-data.csv",
            "after_lock_date": "2020-03-24T00",
            "before_lock_date": "2020-03-21T00",
        },
        analysis_flow,
        adapter=adapter,
    )

    output = ["t_value", "p_value", "before_sample", "after_sample"]
    # dr.visualize_execution(output, './my_file.dot', {})
    result = dr.execute(output)

For file paths and dates, I can intuitively make sense of them being "used as input", but for an int or string, I would just have to guess if it triggers a @config.when

Config becomes nodes in the graph -- and you have always been able to request config/inputs as outputs.

Nonetheless, this is related to the recent visualization PR I merged. Config nodes were only displayed in .display_all_functions() because they wouldn't be part of the execution path otherwise (even if they influenced the path of which node was executed).

With this viz PR, the above numpy example now shows input_file_name, after_lock_date, and before_lock_date on the display whether or not their values are read.

zilto added the refactor label Apr 23, 2024

zilto closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve API for `.execute()` all nodes #836

Improve API for `.execute()` all nodes #836

zilto commented Apr 23, 2024

elijahbenizzy commented Apr 23, 2024

zilto commented Apr 24, 2024 •

edited

elijahbenizzy commented Apr 25, 2024

skrawcz commented Apr 25, 2024

elijahbenizzy commented Apr 25, 2024 •

edited

skrawcz commented Apr 26, 2024 •

edited

elijahbenizzy commented Apr 26, 2024

skrawcz commented Apr 26, 2024

elijahbenizzy commented Apr 26, 2024

zilto commented Apr 26, 2024 •

edited

Improve API for .execute() all nodes #836

Improve API for .execute() all nodes #836

Comments

zilto commented Apr 23, 2024

Problem

Current solution

Potential solution

elijahbenizzy commented Apr 23, 2024

zilto commented Apr 24, 2024 • edited

elijahbenizzy commented Apr 25, 2024

skrawcz commented Apr 25, 2024

elijahbenizzy commented Apr 25, 2024 • edited

skrawcz commented Apr 26, 2024 • edited

elijahbenizzy commented Apr 26, 2024

skrawcz commented Apr 26, 2024

elijahbenizzy commented Apr 26, 2024

zilto commented Apr 26, 2024 • edited

Improve API for `.execute()` all nodes #836

Improve API for `.execute()` all nodes #836

zilto commented Apr 24, 2024 •

edited

elijahbenizzy commented Apr 25, 2024 •

edited

skrawcz commented Apr 26, 2024 •

edited

zilto commented Apr 26, 2024 •

edited