Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Warn if duplicate columns are found in check Snapshot strategy #9656

Open
3 tasks done
christineberger opened this issue Feb 23, 2024 · 3 comments · May be fixed by #9698
Open
3 tasks done

[Feature] Warn if duplicate columns are found in check Snapshot strategy #9656

christineberger opened this issue Feb 23, 2024 · 3 comments · May be fixed by #9698
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors snapshots Issues related to dbt's snapshot functionality

Comments

@christineberger
Copy link
Contributor

christineberger commented Feb 23, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Error if there are duplicate columns defined in the check_col configuration of a snapshot
Currently, the check strategy for Snapshots does not throw an error if there are duplicate column names specified in the check_cols config, and accidentally specifying a column twice results in a record being inserted even if none of the column values changed.

Recreation Steps (Tested on Snowflake and BigQuery)

  1. Create a model for mocked data called test_staging.sql:
with

source as (
    select
        '123abc' as surrogate_key,
        '2' as id,
        'N' as status,
        coalesce(cast('2024-01-01' as date), current_date) as birth_date,
        '456 Stripe Ln' as address,
        'christine@dbtlabs.com' as email
)

select * from source
  1. Create a model in the snapshots folder called dupes_issue.sql:
{% snapshot dupes_issue %}
    {{
        config(
            unique_key='id',
            strategy='check',
            check_cols=[
                'id',
                'id',
                'status',
                'address',
                'birth_date',
                'email'
            ],
            target_schema=target.schema,
            invalidate_hard_deletes=false
        )
    }}

    select * from {{ ref('test_staging') }}
 {% endsnapshot %}
  1. Run the snapshot with dbt build -s +dupes_issue. Inspect the results (you should only see one record).
  2. Run the snapshot again and inspect the results - you should see duplicate records:
    Screenshot 2024-02-23 at 4 07 08 PM

Describe alternatives you've considered

This is easily fixed by removing the duplicate column, but I believe we could make the user experience better in terms of troubleshooting.

Who will this benefit?

Anyone using snapshots!

Speaking from my own experience, this came from troubleshooting a snapshot with ~135 columns in the check_col config (which was looked at many times over about two weeks by me and another team). Having an error thrown at us would have saved all of us some time.

Are you interested in contributing this feature?

No response

Anything else?

No response

@christineberger christineberger added enhancement New feature or request triage labels Feb 23, 2024
@dbeatty10 dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label Feb 24, 2024
@dbeatty10 dbeatty10 self-assigned this Feb 24, 2024
@dbeatty10
Copy link
Contributor

this came from troubleshooting a snapshot with ~135 columns

👀 🤯

Currently, the check strategy for Snapshots does not throw an error if there are duplicate column names specified in the check_cols config, and accidentally specifying a column twice results in a record being inserted even if none of the column values changed.

Okay, we buy it @christineberger !

Thanks for writing up a nice reproducible example and proposing a solution. I'm labeling this as "help wanted" for a community member to pick up.

Acceptance criteria

  • Raise an error if there are duplicate columns defined in the check_col configuration of a snapshot

@dbeatty10 dbeatty10 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors and removed triage labels Feb 24, 2024
@dbeatty10 dbeatty10 removed their assignment Feb 24, 2024
@ariosramirez
Copy link

Hi @dbeatty10.
I'm going to take this issue to work on!

@ariosramirez
Copy link

Hello guys, can you check if it is what you expected? or if something else is needed?

PR link

cc: @dbeatty10 @christineberger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors snapshots Issues related to dbt's snapshot functionality
Projects
None yet
3 participants