pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior #1317

rudolfix · 2024-05-03T11:07:34Z

Background
By default we do not add load_id and dlt_id to arrow tables. This must be configured explicitly and happens in the normalizer.
As a consequence, we need to decompress and rewrite parquet files which takes a lot of resources.
In this ticket we move this behavior to the extract phase. This is against general architecture but I do not see any other way to do that without rewriting files.
We also unify the behavior making relational normalizer to follow ItemsNormalizerConfiguration

Tasks

- move the logic that adds _dlt_id and load_id from ArrowItemsNormalizer to the extract phase. looks like the best place to add them is _write_item (just before normalization)
- we (probably) do not need the logic that adds the columns when writing a file. we can just add them to existing table
- when adding _dlt_id we must follow table settings and generate _dlt_id according to hints (ie. SCD2 look add_row_hash_to_table)
- ItemsNormalizerConfiguration must be taken into account. this is probably a breaking change because we need to move it from normalize to extract so old settings will stop working
- in _compute_table when we add new columns we should also infer hints like for any new columns. currently schema settings will be ignored

unify behavior
6. * [ ] in relational.py do not generate _load_id and _dlt_id if not switched on. (both are ON by default)

The text was updated successfully, but these errors were encountered:

rudolfix added the tech-debt Leftovers from previous sprint that should be fixed over time label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior #1317

pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior #1317

rudolfix commented May 3, 2024

pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior #1317

pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior #1317

Comments

rudolfix commented May 3, 2024