Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

Advice - handling deployments whilst a DAG is running #13

Open
darrenhaken opened this issue Aug 16, 2018 · 5 comments
Open

Advice - handling deployments whilst a DAG is running #13

darrenhaken opened this issue Aug 16, 2018 · 5 comments

Comments

@darrenhaken
Copy link

First of all, I am thrilled you're working on this operator! Also, great work on Composer.

I was wondering if anyone was prepared to discuss achieving DAG reliability whilst a component is being deployed on Airflow. As Kubernetes routinely can schedule pods I imagined it requires higher reliability from DAGs.

When using Airflow to say run a Spark job on Dataproc, what would happen to a DAG run if a restart were to happen? Do you have any advice on improving reliability?

Please feel free to reply and talk offline if that's useful. Hopefully you can provide input

@barney-s
Copy link
Contributor

By DAG reliability do you mean what happens if i restart pods (mapping to airflow tasks) often.
My understanding is airflow Tasks are meant to be designed to be idempotent (no side effects).
That would take care of unreliable pod scheduling even with celery/worker.

But for cases where you mentioned i would like to loop in a bit more folks to advise.
@liyinan926
@dimberman

@darrenhaken
Copy link
Author

@barney-s thanks for the answer. I wasn't sure if the entire DAG failed or simply the task; it looks like its the task.

Any other info im interested in hearing about

@darrenhaken
Copy link
Author

Hi quickly want to pick this back up to get a bit more detail from you.

So let's say I have a task running during a DAG run, for example running a remote Spark job. Would you expect the task to fail as a deployment would cause the Pod to be replaced with a new instance during a deployment?

The task would then retry and the Spark job resubmit?

@dimberman
Copy link

@darrenhaken if the pod running the spark job fails and your using the k8s executor it will report as a failure of the task. It would be difficult to have a pod come back up and somehow recreate task-level state since we don't have access to that task level information.

@gegnew
Copy link

gegnew commented Feb 17, 2020

Hi, sorry to necropost a bit here, but I recently had an issue where a redeployment of our Airflow service caused a DAG to hang. A redeployment (via terraform on ECS) occurred during a DAG run, and for some reason that DAG run was never marked as "failed", but was marked as "running" after the redeployment, even though nothing was, in fact, running. Since our DAG was configured without allowing parallel runs, this stopped any more DAG runs from starting.

Any thoughts about how to prevent this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants