You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background
By default we do not add load_id and dlt_id to arrow tables. This must be configured explicitly and happens in the normalizer.
As a consequence, we need to decompress and rewrite parquet files which takes a lot of resources.
In this ticket we move this behavior to the extract phase. This is against general architecture but I do not see any other way to do that without rewriting files.
We also unify the behavior making relational normalizer to follow ItemsNormalizerConfiguration
Tasks
move the logic that adds _dlt_id and load_id from ArrowItemsNormalizer to the extract phase. looks like the best place to add them is _write_item (just before normalization)
we (probably) do not need the logic that adds the columns when writing a file. we can just add them to existing table
when adding _dlt_id we must follow table settings and generate _dlt_id according to hints (ie. SCD2 look add_row_hash_to_table)
ItemsNormalizerConfiguration must be taken into account. this is probably a breaking change because we need to move it from normalize to extract so old settings will stop working
in _compute_table when we add new columns we should also infer hints like for any new columns. currently schema settings will be ignored
unify behavior
6. * [ ] in relational.py do not generate _load_id and _dlt_id if not switched on. (both are ON by default)
The text was updated successfully, but these errors were encountered:
Background
By default we do not add
load_id
anddlt_id
to arrow tables. This must be configured explicitly and happens in the normalizer.As a consequence, we need to decompress and rewrite parquet files which takes a lot of resources.
In this ticket we move this behavior to the extract phase. This is against general architecture but I do not see any other way to do that without rewriting files.
We also unify the behavior making
relational
normalizer to followItemsNormalizerConfiguration
Tasks
_dlt_id
andload_id
fromArrowItemsNormalizer
to the extract phase. looks like the best place to add them is_write_item
(just before normalization)_dlt_id
according to hints (ie. SCD2 look add_row_hash_to_table)ItemsNormalizerConfiguration
must be taken into account. this is probably a breaking change because we need to move it fromnormalize
toextract
so old settings will stop working_compute_table
when we add new columns we should also infer hints like for any new columns. currently schema settings will be ignoredunify behavior
6. * [ ] in
relational.py
do not generate _load_id and _dlt_id if not switched on. (both are ON by default)The text was updated successfully, but these errors were encountered: