Airflow with spark 2.4.8 with a sample DAG and spark-submit job.
Docker and docker-compose should be installed
Clone the repo, and run the build.sh file. This will create an image airflow:latest and will start the container. [Note: Image can also be directly downloaded by 'docker pull senchandra/airflow'. If using this, change the image in docker-compose.yml from airflow to senchandra/airflow]
- Spark standalone cluster manager (Master and slave). Spark UI at port 8080 and spark master at spark://127.0.1.1:7077
- Airflow webserver (Running on port 8089)
- Airflow scheduler
This is a simple DAG with a simple wordcount spark job.
- Open the Airflow webserver (localhost:8089). Login using credentials username-admin, password-admin.
- Allow all the DAG(s) to be loaded as DAG refreshing takes some time. The sample DAG is under the name 'sparkoperator_demo' which requires a spark connection to be made with the connection id as 'spark_local'.
- Open Connections under Admin tab. Click on 'Create a new Record' (+ button). Enter the details below in Connection configuration and then click on Save: a. Connection Id - spark_local b. Connection Type - spark c. Host - spark://127.0.1.1 d. Port - 7077
- Go to DAGs tab. Click on sparkoperator_demo. Toggle to the Pause button to Unpause. Click on the Play button on the right selecting the dropdown option 'Trigger DAG'.
- Toggle the 'Auto Refresh' button to see the progress. Click on 'Graph' -> 'spark_submit_task' -> 'Log' to see the spark logs.
- While the DAG and spark job is running, spark application is registerd in the cluster manager which can be observed in Spark UI at localhost:8080.
- To add further DAGs, add your python dag file in /root/airflow/dags/ folder inside the container.. DAG auto registers and upon refreshing the Airflow UI at loclahost:8089 can be seen under DAGs tab. It may not be visible if there is any error in dag file, which would be observed in Airflow UI or thorugh the command 'airflow dags list' inside the container.