-
Hello, I am currently trying to use Airflow in an HPC which uses Slurm to distribute jobs over different nodes. We don't wish to replace Slurm but use Airflow to launch jobs. I have tried to integrate both systems during the last two weeks, but haven't found a proper way to do it. I haven't found information on any site either. Approach 1: create a custom Executor In this case, the custom executor generates the Slurm command: I found some problems:
Approach 2: create a custom Operator (based on the BashOperator) a) Using srun However:
b) Using sbatch However:
Does anyone have experience with this issue? I would appreciate your input! Thanks :) |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 20 replies
-
No idea about SLURM, but: Approach 1: You need to implement your executor that it will also monitor and report the task status back. Since Airlfow is a distributed system, simple "task error code" is not enough to keep the sate. The task might fail for various reasons, it can then be retried when - for example - celery worker is restarted and there are various edge cases. There are two ways state might be updated - tasks can modify their own state when they succeed in the database or the monitoring (executor) determines that the tasks are failed or did not have time to update the state and updates it for the task. This is due to distributed nature of Aiflow, potential failover scenarios and the like - so sime "exit code" is not good indicator and you should not base task state on that. Writing your executor means implementing a lot of failover and failure edge cases and it's quite a complex task. Approach 2:
|
Beta Was this translation helpful? Give feedback.
-
Hi @potiuk Thank you for your answer! We've ended up developing a deferrable operator and a trigger. It first submits the job to Slurm and then it defers itself until the trigger detects a state change / new output from the slurm job's log file. Depending on the slurm's state, we defer the operator again, finish OK or raise an AirflowException. Since triggers are able to run in a highly-available fashion, we will be able to to restart Airflow for any reason without losing track of the already submitted Slurm jobs. As for idempotency, in our case rerunning the task overrides the data. |
Beta Was this translation helpful? Give feedback.
-
I found a DaskExecutor. Dask can use slurm to launch a cluster. I have never tried this idea though. |
Beta Was this translation helpful? Give feedback.
Hi @potiuk
Thank you for your answer!
We've ended up developing a deferrable operator and a trigger. It first submits the job to Slurm and then it defers itself until the trigger detects a state change / new output from the slurm job's log file. Depending on the slurm's state, we defer the operator again, finish OK or raise an AirflowException.
Since triggers are able to run in a highly-available fashion, we will be able to to restart Airflow for any reason without losing track of the already submitted Slurm jobs.
As for idempotency, in our case rerunning the task overrides the data.