Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit intermediary files size by default #1328

Closed
akelad opened this issue May 7, 2024 · 1 comment · Fixed by #1368
Closed

Limit intermediary files size by default #1328

akelad opened this issue May 7, 2024 · 1 comment · Fixed by #1368
Assignees
Labels
community This issue came from slack community workspace enhancement New feature or request

Comments

@akelad
Copy link

akelad commented May 7, 2024

Feature description

We should consider limiting the intermediary files size by default as a lot of destinations (e.g. BigQuery) have a maximum file size they can handle. Otherwise a pipeline might run for 2h and cause an issue like this:

[ERROR ]|17898|8637379136|dlt|load.py|complete_jobs:311|Job for analytics_events.d35853bf5f.jsonl failed terminally in load 1715003335.526644 with message {"error_result":{"reason":"invalid","message":"Error while reading data, error message: Input JSON files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 5706284890. Max allowed size is: 4294967296."},"errors":[{"reason":"invalid","message":"Error while reading data, error message: Input JSON files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 5706284890. Max allowed size is: 4294967296."}],"job_start":"2024-05-06T15:37:59.326000Z","job_end":"2024-05-06T15:37:59.405000Z","job_id":"analytics_events_d35853bf5f_0_jsonl"}

See slack thread for context: https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1715010931231539

Are you a dlt user?

None

Use case

No response

Proposed solution

No response

Related issues

No response

@akelad akelad added enhancement New feature or request community This issue came from slack community workspace labels May 7, 2024
@rudolfix
Copy link
Collaborator

Implementation idea:

  1. add new destination capability: recommended file size,
  2. in buffered writer when caps are present and no explicit limit is set - use it
  3. set it for bigquery. 1GB looks like a safe option. try to look for snowflake and databricks, if there are any recommendations then follow them. otherwise leave None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants