Handle db isolation for mapped operators and task groups #39259

dstandish · 2024-04-25T14:59:34Z

No description provided.

airflow/models/taskinstance.py

dstandish · 2024-04-25T18:52:28Z

airflow/models/taskinstance.py

+        if isinstance(self.task, MappedOperator):
+            self.task = context["ti"].task


So, this is an interesting one @potiuk. The way mapped operators are "expanded" or "unmapped"... it happens inside of MappedOperator.render_template_fields. It does so by replacing the task attr on the ti in the context dictionary, which in the non-db-isolation case mutates what is here self.task! But in db isolation case, the context dict is created via RPC and so the pydantic TI in the context dict is not the same as the PydanticTI that is running.... it's .... quite complicated. But anyway this here is one way to ensure that the task gets properly unmapped -- we don't here rely on mutating the TI in the context dict.

dstandish · 2024-04-25T19:25:32Z

and here @uranusjr ?

uranusjr · 2024-04-26T03:16:56Z

airflow/models/taskinstance.py

+    # when taking task over RPC, we need to add the dag back
+    if isinstance(task, MappedOperator):
+        if not task.dag:
+            task.dag = dag
+    elif not task._dag:
+        task._dag = dag


Not a fan for this… can we do this earlier in the stack, say when the task is created instead?

the issue @uranusjr is that this is early in the stack when it's a RPC call. the only earlier place we could do it is in the decorator. WDYT? we could stick it in a private function though and get it out of the way and reuse in module though....

when not a RPC call, this has no effect and is not needed.

fyi @uranusjr this is resolved here (Use sentinel to elide the dag object on reserialization) but i can't make this PR yet because it's depending on too many other PRs to get merged first

uranusjr · 2024-04-26T03:17:47Z

airflow/models/taskinstance.py

+        if isinstance(self.task, MappedOperator):
+            self.task = context["ti"].task


Does this not work with BaseOperator? The conditional makes this a lot weirder.

That’s right because only when it is mappedoperator is ti.task mutated. Otherwise ti.task is the result of rpc call and long story short it can’t be used

that make sense @uranusjr ?

so with normal task, self.task is the task that is created locally, and there is no need to override it from the one in context dict. and if you did that then you'd take a task object that isn't quite complete, essentially because we don't have proper serialization of Task since there's no real Task entity and no TaskPydantic. But generally it's not a problem because most of the time we don't need to serialize a task object.

in the mappedoperator case though, as we saw last night, "unmapping" is achieved by mutating the ti in the context dict, and it relies on the assumption that the TI in the context dict is the same object as the one that is created locally and being run, which isn't true when the context comes from RPC.

if searching for alternatives, we could look at not relying on the context dict for this "unmapping". e.g. we could forword the "original" ti object to the thing doing the unmapping so we don't need to mutate what's in context.

another option would be, upon receiving a fresh context dict over RPC, we could replace the TIs in the context with the local TIPydantic object -- or something to this effect. then perhaps we could keep the context["ti"] mutation approach for unmapping.

we could also look at changing the way we handle context over RPC. currently it's just a "working" approach but not optimal because there's no laziness. we could optimize by making each context object an accessor that is an RPC call (and we should do something like this ). and something like that could help here too.

It makes sense, but if isinstance(self.task, MappedOperator) is an awkward condition to check for the case.

upon receiving a fresh context dict over RPC, we could replace the TIs in the context with the local TIPydantic object

This sounds somewhat promising. Instead of just the ti, we could probably try to replace the entire relationship (including e.g. dag) so we can get rid of needing to pass in dag separately into _record_task_map_for_downstreams.

It makes sense, but if isinstance(self.task, MappedOperator) is an awkward condition to check for the case.

yeah, i see what you're saying. e.g. better would be for the code to "tell us" when an unmap has happened.

like when we call

original_task.render_template_fields(context, jinja_env)

that could like... return a new task when it creates one. that would certainly make it more obvious what is going on too.

uranusjr · 2024-05-20T16:58:56Z

airflow/models/taskinstance.py

+    #  currently possible for a downstream to depend on one individual mapped
+    #  task instance. This will change when we implement task mapping inside
+    #  a mapped task group, and we'll need to further analyze the case.


Accidental?

no -- adding the indent will make it so all of this text is "part of" the todo (i.e. all show yellow in IDE) if we don't do this then it looks like separate comment... just a driveby "fix" but i can remove if you like

Handle db isolation for mapped operators and task groups

50fb03b

dstandish requested a review from potiuk April 25, 2024 14:59

dstandish requested review from uranusjr, kaxil, XD-DENG and ashb as code owners April 25, 2024 14:59

potiuk reviewed Apr 25, 2024

View reviewed changes

airflow/models/taskinstance.py Outdated Show resolved Hide resolved

dstandish commented Apr 25, 2024

View reviewed changes

airflow/models/taskinstance.py Outdated Show resolved Hide resolved

Update airflow/models/taskinstance.py

32c7e1f

dstandish commented Apr 25, 2024

View reviewed changes

uranusjr reviewed Apr 26, 2024

View reviewed changes

uranusjr reviewed May 20, 2024

View reviewed changes

dstandish mentioned this pull request May 24, 2024

Use sentinel to elide the dag object on reserialization #39825

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle db isolation for mapped operators and task groups #39259

Handle db isolation for mapped operators and task groups #39259

dstandish commented Apr 25, 2024

dstandish Apr 25, 2024

dstandish commented Apr 25, 2024

uranusjr Apr 26, 2024

dstandish Apr 29, 2024 •

edited

dstandish May 20, 2024

uranusjr Apr 26, 2024

dstandish Apr 26, 2024

dstandish Apr 26, 2024 •

edited

uranusjr Apr 26, 2024

dstandish Apr 26, 2024

uranusjr May 20, 2024

dstandish May 20, 2024

		if isinstance(self.task, MappedOperator):
		self.task = context["ti"].task

Handle db isolation for mapped operators and task groups #39259

Are you sure you want to change the base?

Handle db isolation for mapped operators and task groups #39259

Conversation

dstandish commented Apr 25, 2024

Choose a reason for hiding this comment

dstandish commented Apr 25, 2024

Choose a reason for hiding this comment

dstandish Apr 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish Apr 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish Apr 29, 2024 •

edited

dstandish Apr 26, 2024 •

edited