Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support loading DAG definitions from S3 buckets #249

Open
thesuperzapper opened this issue Jun 30, 2021 · 5 comments · May be fixed by #782 or #828
Open

support loading DAG definitions from S3 buckets #249

thesuperzapper opened this issue Jun 30, 2021 · 5 comments · May be fixed by #782 or #828
Labels
kind/enhancement kind - new features or changes

Comments

@thesuperzapper
Copy link
Member

Currently we support git-sync with the dags.gitSync.* values, but we can probably do something similar for S3 buckets. That is, let people store their dags in a folder on an S3 bucket.

Possibly we should generalise this to include GCS and ABS, but these probably have different libraries needed to do the sync (so might need to be separate features/containers). However, clearly S3 is the best place to start, as it's the most popular.

@thesuperzapper thesuperzapper added the kind/enhancement kind - new features or changes label Jun 30, 2021
@thesuperzapper thesuperzapper added this to Unsorted in Issue Triage and PR Tracking via automation Jun 30, 2021
@thesuperzapper thesuperzapper moved this from Unsorted to To Do - Enhancements in Issue Triage and PR Tracking Jun 30, 2021
@thesuperzapper thesuperzapper moved this from To Do - Enhancements to To Do - P2 in Issue Triage and PR Tracking Jul 8, 2021
@thesuperzapper thesuperzapper moved this from TODO/P2 to TODO/P3 in Issue Triage and PR Tracking Jul 8, 2021
@thesuperzapper thesuperzapper moved this from To Do | priority-p3 to To Do | priority-p2 in Issue Triage and PR Tracking Aug 13, 2021
@stale stale bot added lifecycle/stale lifecycle - this is stale and removed lifecycle/stale lifecycle - this is stale labels Aug 29, 2021
@stale stale bot added lifecycle/stale lifecycle - this is stale and removed lifecycle/stale lifecycle - this is stale labels Oct 29, 2021
@stale stale bot added the lifecycle/stale lifecycle - this is stale label Dec 30, 2021
@stale stale bot closed this as completed Jan 7, 2022
Issue Triage and PR Tracking automation moved this from Backlog | Medium Priority to Done Jan 7, 2022
@thesuperzapper thesuperzapper reopened this Jan 9, 2022
Issue Triage and PR Tracking automation moved this from Done to Unsorted Jan 9, 2022
@stale stale bot removed the lifecycle/stale lifecycle - this is stale label Jan 9, 2022
@thesuperzapper thesuperzapper moved this from Unsorted to Backlog | Medium Priority in Issue Triage and PR Tracking Jan 9, 2022
@thesuperzapper thesuperzapper added the lifecycle/frozen lifecycle - this can't become stale label Jan 9, 2022
@thesuperzapper thesuperzapper removed the lifecycle/frozen lifecycle - this can't become stale label Mar 22, 2022
@thesuperzapper thesuperzapper added this to the airflow-8.8.0 milestone Mar 22, 2022
@thesuperzapper thesuperzapper moved this from Backlog | Medium Priority to Triage | Needs PR in Issue Triage and PR Tracking Mar 22, 2022
@airflow-helm airflow-helm deleted a comment from stale bot Apr 20, 2022
@airflow-helm airflow-helm deleted a comment from stale bot Apr 20, 2022
@airflow-helm airflow-helm deleted a comment from stale bot Apr 20, 2022
@thesuperzapper thesuperzapper changed the title Implement s3 DAGs sync feature support loading DAG definitions from s3 Apr 20, 2022
@yossisht9876
Copy link

hey guys,

until we have it as native solution i created a sidecar container for syncing dags from aws s3
take a look :)

https://github.com/yossisht9876/airflow-s3-dag-sync

@thesuperzapper thesuperzapper changed the title support loading DAG definitions from s3 support loading DAG definitions from S3 buckets May 4, 2022
@tarekabouzeid
Copy link

Hi @thesuperzapper ,

I started working on this and implementing kind of similar to syncing dags from git as you mentioned - My approach is that we can use rclone sync "running as k8s job" to fetch data from s3 bucket containing the dags and store these dags in a mount volume , that volume is also mounted to AF scheduler pod - Should I continue implementing that ?

Best Regards,

@yossisht9876
Copy link

yossisht9876 commented Jul 27, 2022

i have a better solution but you have to configure pvc for the dag bag folder /opt/airflow/dags

after the pvc is ready you just need to create a cronejob that run every X min and sync 2 ways from s3

kind: CronJob
metadata:
  name: s3-sync
  namespace: airflow
spec:
  schedule: "* * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: aws-cli
              image: amazon/aws-cli
              env:
                - name: AWS_REGION
                  value: us-east-1
              args:
                - --no-progress
                - --delete
                - s3
                - sync
                - s3://bucket-name
                - /opt/airflow/dags/
              volumeMounts:
                - name: dags-data
                  mountPath: /opt/airflow/dags/
          volumes:
            - name: dags-data
              persistentVolumeClaim:
                claimName: airflow-dags
          restartPolicy: OnFailure
      ttlSecondsAfterFinished: 172800

@benchoncy benchoncy linked a pull request Sep 5, 2023 that will close this issue
6 tasks
@darren-recentive
Copy link

darren-recentive commented Oct 21, 2023

i have a better solution but you have to configure pvc for the dag bag folder /opt/airflow/dags

after the pvc is ready you just need to create a cronejob that run every X min and sync 2 ways from s3

kind: CronJob
metadata:
  name: s3-sync
  namespace: airflow
spec:
  schedule: "* * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: aws-cli
              image: amazon/aws-cli
              env:
                - name: AWS_REGION
                  value: us-east-1
              args:
                - --no-progress
                - --delete
                - s3
                - sync
                - s3://bucket-name
                - /opt/airflow/dags/
              volumeMounts:
                - name: dags-data
                  mountPath: /opt/airflow/dags/
          volumes:
            - name: dags-data
              persistentVolumeClaim:
                claimName: airflow-dags
          restartPolicy: OnFailure
      ttlSecondsAfterFinished: 172800

Not a bad idea, I'd also add that if you want the GitOps approach, you can disable the schedule via suspend: true
Then create an ad-hoc s3-sync Job/Pod from the CronJob as a template from your CICD via kubectl create --from=cronjob/s3-sync
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#-em-job-em-

@chirichidi chirichidi linked a pull request Feb 16, 2024 that will close this issue
6 tasks
@thesuperzapper
Copy link
Member Author

thesuperzapper commented May 1, 2024

I just want to say that while baked-in support for s3-sync did NOT make it into version 8.9.0 of the chart, you can use the extraInitContainers and extraContainers values that were added in #856.

Now you can effectively do what was proposed in #828, by using the following values:

  • For Scheduler/Webserver/Workers (but not KubernetesExecutor):
    • airflow.extraContainers (looping sidecar to sync into dags folder)
    • airflow.extraInitContainers (initial clone of S3 bucket into dags folder)
    • airflow.extraVolumeMounts (mount the emptyDir)
    • airflow.extraVolumes (define an emptyDir volume)
  • For KubernetesExecutor Pod template:
    • airflow.kubernetesPodTemplate.extraContainers (you don't need the sidecar for transient Pods)
    • airflow.kubernetesPodTemplate.extraInitContainers
    • airflow.kubernetesPodTemplate.extraVolumeMounts
    • airflow.kubernetesPodTemplate.extraVolumes

If someone wants to share their values and report how well it works, I am sure that would help others.

PS: You can still use a PVC-based approach, where you have a Deployment (or CronJob) that syncs your S3 bucket into that PVC as described in #249 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement kind - new features or changes
Projects
4 participants