Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add s3-sync sidecar #828

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

chirichidi
Copy link

@chirichidi chirichidi commented Feb 16, 2024

What issues does your PR fix?

What does your PR do?

Overview

This Pull Request introduces a new feature, s3Sync, designed to enhance our application's ability to synchronize data with AWS S3. This addition aims to provide a more robust and flexible solution for managing cloud storage synchronization tasks.

Details

  • New Feature: Implemented the s3Sync functionality, leveraging the official aws-cli library and straightforward logic to establish core functionalities such as stability and the automatic detection of changes.
  • Ensured that the new s3Sync feature is fully compatible with our existing infrastructure and does not introduce any breaking changes or dependencies.

No Changes to Existing gitSync Functionality

  • It's crucial to note that while developing the s3Sync feature, special care was taken not to modify or affect the existing gitSync functionality. Our commitment was to add value without disrupting current operations or workflows.
  • Comprehensive testing has been conducted to confirm that gitSync remains unaffected and operates as expected.

Testing and Validation

  • Conducted basic tests such as ct lint and helm template, and performed operational tests on an actual Kubernetes cluster, especially gitSync, continue to operate without any issues.
image 스크린샷 2024-02-17 오전 1 27 10

Conclusion

This enhancement is a step forward in our ongoing efforts to provide a seamless and powerful toolset for our users. By introducing s3Sync, we are expanding our capabilities while ensuring the integrity and performance of our existing features remain intact.

I look forward to your feedback and any discussions regarding this PR. Thank you for considering these enhancements.

Checklist

For all Pull Requests

For releasing ONLY

Signed-off-by: chirichidi <tolessnoise@gmail.com>
@chirichidi chirichidi marked this pull request as ready for review February 16, 2024 17:16
Signed-off-by: chirichidi <tolessnoise@gmail.com>
@rsotogar
Copy link

I am curious to know how syncing DAGs from S3 work? do we need to create a kubectl secret with our AWS key and secret key, and how often will it poll for new DAG files/ folders?

@wikitops
Copy link

wikitops commented Apr 9, 2024

This would be useful to have in the Helm chart. Git sync is sometime not the best option.

@pedorro
Copy link

pedorro commented Apr 29, 2024

just another bump on this PR. This seems like the best option for AirFlow deployments in EKS.

And just to clarify about AWS credentials, in general we would be using IAM roles rather than user credentials, so there should be no need for additional k8s secrets, or anything like that.

@thesuperzapper
Copy link
Member

@chirichidi thanks for the very interesting PR, I would love to get "s3-sync" as a concept into the chart (as it will help users migrate from MWAA).

The main thing we need to finalize is the "reconciliation loop", everything else is secondary and can be updated later.

If I understand your PR correctly, you have done the following:

  • You have implemented a "sidecar" pattern similar to our gitsync sidecar
  • An init-container which runs the following command (to populate the dags folder as the pod starts)
    • aws s3 cp --recursive s3://<BUCKET>/<PATH> ./path/to/dags
  • A sidecar container which runs the following command on loop (to keep the dags folder up to date):
    • aws s3 sync --delete s3://<BUCKET>/<PATH> ./path/to/dags

My main concerns are:

  1. What happens when a sync is halfway, but airflow starts refreshing the DAGs (so we have some old and some new)?
    • This is avoided in git-sync by using symbolic link switching.
    • It's likley rare for something seriously bad to happen when this occurs, so it might not matter.
  2. Wouldn't it make more sense to also use aws s3 sync for the init-container?
    • Because init-containers can sometimes run again (when the pod restarts), so we can save a bit of time by not re-downloading everything.
  3. I wonder if we might want to use some or all of the following aws sync parameters:
    • --quiet
    • --only-show-errors

Are there any other things I have missed?

PS: if/when we merge this, I will update the values/docs in your PR to match the style of the chart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support loading DAG definitions from S3 buckets
5 participants