Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior #1317

Open
rudolfix opened this issue May 3, 2024 · 0 comments
Labels
tech-debt Leftovers from previous sprint that should be fixed over time

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented May 3, 2024

Background
By default we do not add load_id and dlt_id to arrow tables. This must be configured explicitly and happens in the normalizer.
As a consequence, we need to decompress and rewrite parquet files which takes a lot of resources.
In this ticket we move this behavior to the extract phase. This is against general architecture but I do not see any other way to do that without rewriting files.
We also unify the behavior making relational normalizer to follow ItemsNormalizerConfiguration

Tasks

    • move the logic that adds _dlt_id and load_id from ArrowItemsNormalizer to the extract phase. looks like the best place to add them is _write_item (just before normalization)
    • we (probably) do not need the logic that adds the columns when writing a file. we can just add them to existing table
    • when adding _dlt_id we must follow table settings and generate _dlt_id according to hints (ie. SCD2 look add_row_hash_to_table)
    • ItemsNormalizerConfiguration must be taken into account. this is probably a breaking change because we need to move it from normalize to extract so old settings will stop working
    • in _compute_table when we add new columns we should also infer hints like for any new columns. currently schema settings will be ignored

unify behavior
6. * [ ] in relational.py do not generate _load_id and _dlt_id if not switched on. (both are ON by default)

@rudolfix rudolfix added the tech-debt Leftovers from previous sprint that should be fixed over time label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tech-debt Leftovers from previous sprint that should be fixed over time
Projects
Status: Todo
Development

No branches or pull requests

1 participant