Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Duplicates in eff_sat for initial load #179

Open
FrenkLoonen opened this issue Jan 4, 2023 · 7 comments
Open

[BUG] Duplicates in eff_sat for initial load #179

FrenkLoonen opened this issue Jan 4, 2023 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@FrenkLoonen
Copy link

FrenkLoonen commented Jan 4, 2023

Describe the bug
When I load an eff_sat with a driving key that is not the pk of that source table, I will end up with duplicates. Only happens when doing an initial load.

Environment

dbt version: 1.2.0
dbtvault version: 0.9.2
Database/Platform: Databricks, using the snowflake macro

To Reproduce
Say I have a source model with 3 columns. PK1 and PK2 make up the PK of the model. FK is directly dependent only on PK2.

PK1 PK2 FK
123 AAA ZZZ
123 BBB YYY
456 AAA ZZZ

Model a link and an eff_sat on PK2+FK

Expected behavior
In the link macro that will be used to load from PK2 & FK, a distinct happens, so AAA+ZZZ will only be loaded once. In the eff_sat macro, there are distincts as well, but not in the right place, imo. So now, AAA+ZZZ will be loaded twice. When I put a distinct in the source_data CTE on line 32, the duplicates are gone. Also, I think the other distincts (line 74, 97, 123, 144) can then be removed. (And maybe even change the unions in the records_to_insert CTE to union all?)

AB#5368

@FrenkLoonen FrenkLoonen added the bug Something isn't working label Jan 4, 2023
@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Feb 14, 2023

Thanks for this report! Just to let you know it has not gone unnoticed 😄

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Feb 23, 2023

Hi and thanks for reporting this. We've had some internal discussions and believe from the information you've provided here that it is working as intended.

A change in a PK/HK should mean a change in the record, so just because PK1 is not depended on by anything, the fact it is changing does mean something.

This is our understanding of what you have provided;

Raw/primed stage data

PK1 PK2 FK
123 AAA ZZZ
123 BBB YYY
456 AAA ZZZ

Link

src_pk: [PK1, PK2]
src_fk: [FK1, FK2]
...

Eff Sat

src_pk: [PK1, PK2]
src_dfk: PK2
src_sfk: FK1, FK2
...

I would suggest that potentially, your data model might need re-thinking.

Saying all of this, after working through the above example, I am not sure how you would be configuring your models and what the metadata would look like to achieve what you have described. My above guess has a few issues.

Let's continue to discuss this and see if we can get to the bottom of it

@FrenkLoonen
Copy link
Author

I added a payload column to the example, and specified what my dbtvault config would look like. The Attr column is dependent on the PK of the source file ([PK1, PK2]). But the source file is actually already a join between several tables from the source, so I really had to do proper analysis to see what concepts are in there, which are the keys and which are attributes, and to which key each attribute belonged. So that's how I found out that FK is dependent only on PK2, which is why I modeled [PK2, FK] as a separate link with an eff_sat on top of it. If that's completely wrong, that could be my mistake. I would have wanted to receive separate source files with the individual concepts but unfortunately, going back to the source is not an option.

PK1 PK2 FK Attr
123 AAA ZZZ 1A
123 BBB YYY 1B
456 AAA ZZZ 4A

Link1

src_pk: [PK1, PK2]
src_fk: [PK1, PK2]

Sat1

src_pk: [PK1, PK2]
src_payload: [Attr]

Link2

src_pk: [PK2, FK]
src_fk: [PK2, FK]

Eff Sat

src_pk: [PK2, FK]
src_dfk: [PK2]
src_sfk: [FK]

@EliasAthey
Copy link

Hi 👋 - I've run into a very similar (if not the same) issue with regular satellites (sat macro) sourcing from a table at a different grain.

Describe the bug
Consider one source table: Projects. A project has a Client attribute. The same Client value may associate to many Projects. I have a Client satellite (and related client hub), sourced from the Projects table, with src_pk defined as the hash of the Client attribute and src_hashdiff defined as the hash of client-specific attributes. Important note: src_eff defined as the project-level modified date because there is no client-level modified date available. Satellite is materialized as incremental model with on_schema_change='append_new_columns'.
When I run the satellite model for the first time (or full-refresh), the table gets duplicate rows with the same Hash Key + Change Key values. The duplicates correlate with Project rows.
For example, if there are 2 Projects in the source table with the same Client value, the satellite gets populated with two rows with the same Client HK and same Client CK.
This goes against my expectation that only unique rows (by HK+CK) would be loaded.

Environment
dbt version: 1.5.2
dbtvault version: 0.9.2
Database/Platform: Snowflake

To Reproduce
See bug description

Expected behavior
On first run (or full-refresh) of Satellite model, the resulting table has unique HK+CK combinations as defined for that satellite.

Why I think this is happening
I dug into the compiled SQL and I think this issue comes from the way dbt's incremental materialization works.
Here is a slimmed-downed version of the compiled SQL:

WITH source_data AS (
    SELECT a.CLIENT_HK, a.CLIENT_CK, a.<client_attribute_columns>, a.EFFECTIVE_FROM, a.LOAD_DATETIME, a.SOURCE
    FROM <prime_stage_for_projects_view> AS a
    WHERE a.CLIENT_HK IS NOT NULL
),

latest_records AS (
    SELECT a.CLIENT_HK, a.CLIENT_CK, a.LOAD_DATETIME
    FROM (
        SELECT current_records.CLIENT_HK, current_records.CLIENT_CK, current_records.LOAD_DATETIME,
            RANK() OVER (
               PARTITION BY current_records.CLIENT_HK
               ORDER BY current_records.LOAD_DATETIME DESC
            ) AS rank
        FROM <sat_for_clients_table> AS current_records
            JOIN (
                SELECT DISTINCT source_data.CLIENT_HK
                FROM source_data
            ) AS source_records
                ON current_records.CLIENT_HK = source_records.CLIENT_HK
    ) AS a
    WHERE a.rank = 1
),

records_to_insert AS (
    SELECT DISTINCT stage.CLIENT_HK, stage.CLIENT_CK, stage.<client_attribute_columns>, stage.EFFECTIVE_FROM, stage.LOAD_DATETIME, stage.SOURCE
    FROM source_data AS stage
    LEFT JOIN latest_records
    ON latest_records.CLIENT_HK = stage.CLIENT_HK
        AND latest_records.CLIENT_CK = stage.CLIENT_CK
    WHERE latest_records.CLIENT_CK IS NULL
)

SELECT * FROM records_to_insert;

On first run (or full-refresh), the records_to_insert CTE results in all records from source_data (this is the prime stage view built on Projects table) because every row will meet the criteria: latest_records.CLIENT_CK IS NULL ...after all, there are no current records in the table on the first run. For a DV satellite, I would expect the records_to_insert CTE to select distinct only on stage.CLIENT_HK and stage.CLIENT_CK.

Workarounds

  1. (What I've implemented) Separate distinct Clients into their own table before the data vault prime stage. Avoids the problem altogether.
  2. Add unique_key config (https://docs.getdbt.com/docs/build/incremental-models#defining-a-unique-key-optional) to the satellite model - unique key being HK and CK. However, according the dbt docs, this would result in updates to rows if the same HK+CK was loaded. This breaks DV protocol of append-only satellites.

Questions
Is there a recommended workaround for this?
Is there a custom materialization available (or in the works) for satellites which solves this scenario?

Hopefully this is enough detail to understand the issue. Thank you!

@DVAlexHiggs
Copy link
Member

Hi! We believe this is related to the improvements we recently made to standard satellites, which was a long-standing and fundamental restriction expecting 1 instance of a key per batch.

If users don't meet these expectations then unpredictable behavior will occur.

The recent improvement we've made to standard Satellites are something we're planning to roll out to the other satellite types, including Effectity Satellites.

I cannot 100% confirm that this is the root cause, but from everything contributed here I can make a fair assumption. All of our tests for Effectity Satellites are currently passing, however they assume the standard Data Vault 2.0 loading expectations (true delta feeds, only changes).

We appreciate the ongoing discussion and information provided, we'll make sure to test the existing effectivity satellites using the provided information to reproduce the issues being encountered, and of course add regression tests for these edge cases when a fix is in place.

@DVAlexHiggs
Copy link
Member

Additional thought @EliasAthey are you including the collision key in your hash key? It looks like you're treating it as a composite PK instead.

I'm not by my desk right now so I can't look into this further but I suspect this would solve your issue. Whether or not the composite PK should be treated as unique (I think it should be and you shouldn't be seeing this issue regardless) will need to be looked at as well.

@philmaddocks
Copy link

I've noticed the same thing on initial loads building using the effectivity sat on a stage view over the source table. It simply loads everything for the link hash key, even if the relationship hasn't changed. Our work around as been developing a dedicated raw stage model just to show distinct relationship changes over our source data and feed this to the eff macro (not ideal).

I'd have thought bringing in some of the features of the standard macro such as using a hash diff to only bring changes to the relationship and only adding new records when the hash diff changes on the driving key would be the way to go? That would remove the need to create a dedicated raw stage model just for housing data on a relationship change.

@DVAlexHiggs DVAlexHiggs added bug Something isn't working and removed bug Something isn't working labels May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants