Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pipeline): add an ability to auto truncate #1292

Merged
merged 26 commits into from May 16, 2024
Merged

Conversation

IlyaFaer
Copy link
Collaborator

Closes #1104

staging destination after load
Copy link

netlify bot commented Apr 26, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit eada0dd
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6644ff6a8d2eb10007ec523e

dlt/pipeline/pipeline.py Outdated Show resolved Hide resolved
@sh-rp
Copy link
Collaborator

sh-rp commented Apr 26, 2024

@rudolfix since we have two different concepts of "staging", a staging destination as well as a staging dataset, maybe we should use those terms in config vars more clearly and not just "staging" which could mean both really

@sh-rp
Copy link
Collaborator

sh-rp commented Apr 26, 2024

@IlyaFaer in the ticket it says we want to truncate the staging dataset, not the staging destination. I also think that the setting should probably be part of the LoaderConfiguration and not be passed down from the pipeline. @rudolfix should we always truncate or only if all loadjobs where successfull? If a mergejob fails because of a connection error or something like that I don't think we should truncate the staging tables, because you can retry the load.

@rudolfix
Copy link
Collaborator

@sh-rp we should truncate only when we complete the load id successfully. so after the complete method is executed.

@IlyaFaer
Copy link
Collaborator Author

IlyaFaer commented May 2, 2024

One thing that I didn't manage to solve is how to patch the LoaderConfiguration, to run tests for both cases. I tried mocking, direct replacing the class in the module, but none worked 🤔

@IlyaFaer IlyaFaer marked this pull request as ready for review May 3, 2024 08:05
@IlyaFaer IlyaFaer requested a review from rudolfix May 3, 2024 08:05
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK almost there, see review and

  1. add info on new config ie to ## Data left behind of running.md
  2. we need to enable this behavior for one of the merge tests ie. test_pipeline_upfront_tables_two_loads and make sure that all tables in staging dataset are truncated - this is to test it on all destinations

dlt/load/configuration.py Outdated Show resolved Hide resolved
dlt/load/load.py Outdated Show resolved Hide resolved
dlt/pipeline/pipeline.py Outdated Show resolved Hide resolved
tests/pipeline/test_pipeline.py Show resolved Hide resolved
@IlyaFaer
Copy link
Collaborator Author

IlyaFaer commented May 6, 2024

@rudolfix, okay, what do we do, if the client doesn't have should_load_data_to_staging_dataset()? Qdrant supports both merge and replace, but doesn't have the method.

tests/load/pipeline/test_pipelines.py Outdated Show resolved Hide resolved
dlt/load/load.py Outdated Show resolved Hide resolved
dlt/load/load.py Show resolved Hide resolved
@rudolfix
Copy link
Collaborator

rudolfix commented May 6, 2024

@rudolfix, okay, what do we do, if the client doesn't have should_load_data_to_staging_dataset()? Qdrant supports both merge and replace, but doesn't have the method.

yeah because qdrand does not use staging dataset for merge!

look at my comments: client must implement WithStagingDataset trait which has should_load_data_to_staging_dataset so just move stuff below instance type check!

@IlyaFaer
Copy link
Collaborator Author

IlyaFaer commented May 7, 2024

@rudolfix, hm, some tests fail for Dremio. It seems to me, there staging is always used. Does it mean that staging dataset is not used? I don't see credentials for Dremio in the Secret Manager, so I can't test locally what's going on there.

dlt/load/load.py Outdated Show resolved Hide resolved
dlt/load/load.py Outdated Show resolved Hide resolved
tests/load/pipeline/test_pipelines.py Outdated Show resolved Hide resolved
@rudolfix
Copy link
Collaborator

rudolfix commented May 7, 2024

@rudolfix, hm, some tests fail for Dremio. It seems to me, there staging is always used. Does it mean that staging dataset is not used? I don't see credentials for Dremio in the Secret Manager, so I can't test locally what's going on there.

we only have local container and there are instructions. take a look in dremio folder in destinations. but I think you won't need it if you fix the pr. see review

rudolfix
rudolfix previously approved these changes May 8, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rudolfix rudolfix merged commit ca15401 into devel May 16, 2024
48 of 50 checks passed
@rudolfix rudolfix deleted the truncate_staging branch May 16, 2024 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

truncate the staging dataset tables after load
3 participants