Advice - handling deployments whilst a DAG is running #13

darrenhaken · 2018-08-16T16:51:27Z

First of all, I am thrilled you're working on this operator! Also, great work on Composer.

I was wondering if anyone was prepared to discuss achieving DAG reliability whilst a component is being deployed on Airflow. As Kubernetes routinely can schedule pods I imagined it requires higher reliability from DAGs.

When using Airflow to say run a Spark job on Dataproc, what would happen to a DAG run if a restart were to happen? Do you have any advice on improving reliability?

Please feel free to reply and talk offline if that's useful. Hopefully you can provide input

barney-s · 2018-08-20T15:01:10Z

By DAG reliability do you mean what happens if i restart pods (mapping to airflow tasks) often.
My understanding is airflow Tasks are meant to be designed to be idempotent (no side effects).
That would take care of unreliable pod scheduling even with celery/worker.

But for cases where you mentioned i would like to loop in a bit more folks to advise.
@liyinan926
@dimberman

darrenhaken · 2018-08-20T15:08:28Z

@barney-s thanks for the answer. I wasn't sure if the entire DAG failed or simply the task; it looks like its the task.

Any other info im interested in hearing about

darrenhaken · 2018-08-28T18:22:03Z

Hi quickly want to pick this back up to get a bit more detail from you.

So let's say I have a task running during a DAG run, for example running a remote Spark job. Would you expect the task to fail as a deployment would cause the Pod to be replaced with a new instance during a deployment?

The task would then retry and the Spark job resubmit?

dimberman · 2018-08-31T17:05:14Z

@darrenhaken if the pod running the spark job fails and your using the k8s executor it will report as a failure of the task. It would be difficult to have a pod come back up and somehow recreate task-level state since we don't have access to that task level information.

gegnew · 2020-02-17T16:22:11Z

Hi, sorry to necropost a bit here, but I recently had an issue where a redeployment of our Airflow service caused a DAG to hang. A redeployment (via terraform on ECS) occurred during a DAG run, and for some reason that DAG run was never marked as "failed", but was marked as "running" after the redeployment, even though nothing was, in fact, running. Since our DAG was configured without allowing parallel runs, this stopped any more DAG runs from starting.

Any thoughts about how to prevent this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice - handling deployments whilst a DAG is running #13

Advice - handling deployments whilst a DAG is running #13

darrenhaken commented Aug 16, 2018

barney-s commented Aug 20, 2018

darrenhaken commented Aug 20, 2018

darrenhaken commented Aug 28, 2018

dimberman commented Aug 31, 2018

gegnew commented Feb 17, 2020

Advice - handling deployments whilst a DAG is running #13

Advice - handling deployments whilst a DAG is running #13

Comments

darrenhaken commented Aug 16, 2018

barney-s commented Aug 20, 2018

darrenhaken commented Aug 20, 2018

darrenhaken commented Aug 28, 2018

dimberman commented Aug 31, 2018

gegnew commented Feb 17, 2020