Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 connector #615

Open
NeQuissimus opened this issue Jul 17, 2019 · 0 comments
Open

S3 connector #615

NeQuissimus opened this issue Jul 17, 2019 · 0 comments

Comments

@NeQuissimus
Copy link

Is your feature request related to a problem? Please describe.
Similarly to the file connector, ingesting data from S3 would be fantastic.
S3 can emit notifications of new files onto SQS, Kinesis, etc. so it may be beneficial to hook in there.

Essentially, it would be great if Brooklin could be notified of new S3 files and then ingest the actual files, so we can output them onto Kafka.

It may be necessary to differentiate between different types of files

  • Plain-text line-by-line
  • Single-line JSON objects
  • Pretty-printed JSON

Finally, using import java.util.zip.{ GZIPInputStream, ZipInputStream }, files could be unarchived on-the-fly.

Describe the solution you'd like
Provide the system with an S3 bucket and credentials.
New S3 files will be streamed into data sink (the AWS REST API allows actual streaming of files). Depending on type of file, apply different logic to unarchive/read (see above).
I'd like to have the file streamed into separate Kafka messages depending on the above logic.

For example:

  • New file foo.tar.gz is written to S3
  • Notification is emitted by AWS
  • File is streamed into Brooklin
  • File is automatically unarchived using GZIPInputStream
  • File contains 1.json and 2.json, which have pretty printed JSON objects inside
  • Send each JSON object from each of the file in a separate message onto Kafka

Describe alternatives you've considered

  • Custom implementation of the above logic using an SQS client and Kafka Streams
  • Kafka Connect has an S3 connector but the officially supported one only allows Kafka -> S3, not S3 as a source

Additional context
This would be an extremely valuable connector when working with systems that can export their data feeds to S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant