Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scd2 merge strategy extensions #1252

Open
jorritsandbrink opened this issue Apr 20, 2024 · 0 comments
Open

scd2 merge strategy extensions #1252

jorritsandbrink opened this issue Apr 20, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@jorritsandbrink
Copy link
Collaborator

jorritsandbrink commented Apr 20, 2024

Feature description

This ticket has two components:

  1. primary_key interpretation
  2. Active record literal

1. primary_key interpretation

primary_key is currently ignored when scd2 is used as merge strategy. This ticket suggests a way to start interpreting the primary_key (or, alternatively, the natural_key) hint, such that users can use a natural key and simple row version column (i.e. "updated_at") as surrogate key.

Cases—currently supported:
(1) no primary_key, no row_version_column_name ➜ combine all user columns in hash, store in _dlt_id and mark with x-row-version hint
(2) no primary_key, yes row_version_column_name ➜ mark column row_version_column_name with x-row-version—values for _dlt_id are random as usual

Cases—to implement in this ticket:
(3) yes primary_key, no row_version_column_name ➜ raise error—we need to be able to make a surrogate key, which is not possible without row_version_column_name

  • alternative: ignore primary_key and silently resolve to case (1)—not my preference, I'd rather make it explicit and not do things behind the user's back

(4) yes primary_key, yes row_version_column_name ➜ mark column row_version_column_name with x-row-version, combine primary_key column(s) and row_version_column_name column in hash on-the-fly (values are not persisted)—values for _dlt_id are random as usual

natural_key hint?
The primary_key will not be a technical primary key in the destination table using this approach, as we allow multiple records with the same primary_key. Kimball uses the term natural key instead. We could introduce natural_key as a new hint and use that in all cases described above instead of primary_key (and ignore primary_key completeley if provided).

2. Active record literal

Currently we use a high timestamp literal (i.e. 9999-12-31 00:00...) in the "valid to" column to indicate an active record. This is suboptimal in two cases: (1) the destination has a lower max value for timestamp columns (e.g. Clickhouse) and (2) the user prefers a NULL value instead.

We can provide an active_record_literal configuration option as follows:

@dlt.resource(
    ...,
    write_disposition={
        "disposition": "merge",
        "strategy": "scd2",
        "active_record_literal": "high_timestamp"
    },
    ...,
)

Possible values for active_record_literal would be null (default) and high_timestamp. The literal used for high_timestamp defaults to 9999-12-31 00:00... for destinations that support it, but can be lower for destinations that have a lower max timestamp (e.g. Clickhouse).

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

No response

Proposed solution

No response

Related issues

#828

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

1 participant