Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unit test infers the wrong data type when the first record has some NULL values #821

Open
2 tasks done
leesiongchan opened this issue May 8, 2024 · 12 comments
Open
2 tasks done
Labels
bug Something isn't working unit tests Issues related to built-in dbt unit testing functionality

Comments

@leesiongchan
Copy link

leesiongchan commented May 8, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

unit_tests:
  - name: test_user_events_combined
    model: int_mycfg__user_events_combined
    given:
      - input: ref('stg_mycfg__user_events')
        rows:
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 1,
              is_closed: null, # <--- if we change both of these null values to true/false, the error goes away.
              is_validated: null, # <--- if we change both of these null values to true/false, the error goes away.
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:02:00",
              user_id: 1,
              is_closed: null,
              is_validated: true,
              user_action: validated,
            }
      - input: ref('stg_mycfg__user_snapshot')
        rows: []
    expect:
      rows:
        - { user_id: 1, is_closed: true, is_validated: true }
        - { user_id: 2, is_closed: null, is_validated: true }
        - { user_id: 3, is_closed: true, is_validated: null }
        - { user_id: 4, is_closed: null, is_validated: null }
        - { user_id: 5, is_closed: false, is_validated: true }
        - { user_id: 6, is_closed: true, is_validated: true }

When you run the test you will see error like this.

image

And here is the logs:

image

Input build

image

Expected result build

Interestingly, the incorrect casting happen in the dbt_internal_unit_test_expected cte but not the input cte.


Our current workaround is to use sql to cast the correct type, eg.

select
    '2024-01-01 10:01:00' as created_at,
    1 as user_id,
    null::boolean as is_closed,
    null::boolean as is_validated,
    'created' as user_action
union all
select '2024-01-01 10:02:00', 1, null, true, 'validated'

Expected Behavior

The test runner should be able to guess the data type correctly.

Steps To Reproduce

  1. Provide unit test data with some null values in the first record.
  2. Run the test
  3. ERROR!

Relevant log output

image

Input build

image

Expected result build

Interestingly, the incorrect casting happen in the dbt_internal_unit_test_expected cte but not the input cte.

Environment

- OS: Ubuntu 24.04 WSL
- Python: 3.12.2
- dbt: 1.8.0-rc1

Which database adapter are you using with dbt?

redshift

Additional Context

No response

@leesiongchan leesiongchan added bug Something isn't working triage labels May 8, 2024
@leesiongchan leesiongchan changed the title [Bug] Unit test inferred the wrong data type when first record has NULL value [Bug] Unit test infers the wrong data type when the first record has some NULL values May 8, 2024
@dbeatty10 dbeatty10 added the unit tests Issues related to built-in dbt unit testing functionality label May 9, 2024
@dbeatty10
Copy link
Contributor

Thanks for reporting this @leesiongchan !

I wasn't able to replicate this on my first try. Could you try out the files below and make any tweaks needed to trigger the error you saw?

Project files and commands

models/stg_mycfg__user_events.sql

select
    cast('2024-01-01 10:01:00' as timestamp) as created_at,
    1 as user_id,
    true as is_closed,
    false as is_validated,
    'created' as user_action

models/stg_mycfg__user_snapshot.sql

select 1 as id

models/int_mycfg__user_events_combined.sql

-- {{ ref("stg_mycfg__user_snapshot") }}
select * from {{ ref("stg_mycfg__user_events") }}

models/_unit.yml

unit_tests:
  - name: test_user_events_combined
    model: int_mycfg__user_events_combined
    given:
      - input: ref('stg_mycfg__user_events')
        rows:
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 1,
              is_closed: true,
              is_validated: true,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:02:00",
              user_id: 2,
              is_closed: null,
              is_validated: true,
              user_action: validated,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 3,
              is_closed: true,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 4,
              is_closed: null,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 5,
              is_closed: false,
              is_validated: true,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 6,
              is_closed: true,
              is_validated: true,
              user_action: created,
            }
      - input: ref('stg_mycfg__user_snapshot')
        rows: []
    expect:
      rows:
        - { user_id: 1, is_closed: true, is_validated: true }
        - { user_id: 2, is_closed: null, is_validated: true }
        - { user_id: 3, is_closed: true, is_validated: null }
        - { user_id: 4, is_closed: null, is_validated: null }
        - { user_id: 5, is_closed: false, is_validated: true }
        - { user_id: 6, is_closed: true, is_validated: true }

Run this command:

dbt build -s +stg_mycfg__user_events+1

@leesiongchan
Copy link
Author

You have to change your first record with some NULL values: eg.

unit_tests:
  - name: test_user_events_combined
    model: int_mycfg__user_events_combined
    given:
      - input: ref('stg_mycfg__user_events')
        rows:
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 1,
              is_closed: null,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:02:00",
              user_id: 2,
              is_closed: null,
              is_validated: true,
              user_action: validated,
            }
        ...

@dbeatty10
Copy link
Contributor

@leesiongchan This worked for me -- is it not working for you?

unit_tests:
  - name: test_user_events_combined
    model: int_mycfg__user_events_combined
    given:
      - input: ref('stg_mycfg__user_events')
        rows:
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 1,
              is_closed: null,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:02:00",
              user_id: 2,
              is_closed: null,
              is_validated: true,
              user_action: validated,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 3,
              is_closed: true,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 4,
              is_closed: true,
              is_validated: true,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 5,
              is_closed: false,
              is_validated: true,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 6,
              is_closed: true,
              is_validated: true,
              user_action: created,
            }
      - input: ref('stg_mycfg__user_snapshot')
        rows: []
    expect:
      rows:
        - { user_id: 1, is_closed: null, is_validated: null }
        - { user_id: 2, is_closed: null, is_validated: true }
        - { user_id: 3, is_closed: true, is_validated: null }
        - { user_id: 4, is_closed: true, is_validated: true }
        - { user_id: 5, is_closed: false, is_validated: true }
        - { user_id: 6, is_closed: true, is_validated: true }

@leesiongchan
Copy link
Author

leesiongchan commented May 10, 2024

image

As long as the first record doesn't have any NULL values, the test runs as expected.

image

Oh... could this be dbt-redshift problem?

@dbeatty10
Copy link
Contributor

Ah! I can get the same error as you now.

It runs fine in dbt-postgres, but gives an error in dbt-redshift. See details below for a simplified example.

Reprex

models/some_input.sql

select
    1 as user_id,
    cast(true as boolean) as is_closed

models/my_model.sql

select * from {{ ref("some_input") }}

models/_unit.yml

unit_tests:

  - name: test__my_model
    model: my_model
    given:
      - input: ref('some_input')
        rows:
          - { user_id: 1, is_closed: null }
          - { user_id: 4, is_closed: true }
    expect:
      rows:
        - { user_id: 1, is_closed: null }
        - { user_id: 4, is_closed: true }

Run this command to see the error:

dbt build -s +my_model

Output:

17:09:03    Runtime Error in unit_test test__my_model (models/_unit.yml)
  An error occurred during execution of unit test 'test__my_model'. There may be an error in the unit test definition: check the data types.
   Database Error
    cannot cast type boolean to character varying

But if the order of the given inputs is swapped, then it runs successfully:

models/_unit.yml

unit_tests:

  - name: test__my_model
    model: my_model
    given:
      - input: ref('some_input')
        rows:
          - { user_id: 4, is_closed: true }
          - { user_id: 1, is_closed: null }
    expect:
      rows:
        - { user_id: 1, is_closed: null }
        - { user_id: 4, is_closed: true }

@dbeatty10 dbeatty10 removed the triage label May 10, 2024
@jnapoleon
Copy link

I'm having a similar error where using redshift using csv's as the input and output

image

Looking at the expected output
It's casting something that should be numeric as a character
cast('0' as character varying(128)) as conversion_value,

When it should be casting it as whatever the output model schema is

@dbeatty10
Copy link
Contributor

Thanks for reporting an additional error scenario @jnapoleon 👍

Could you share the relevant example code as well so we can make sure it gets covered by any potential solution?

@dbeatty10
Copy link
Contributor

I tested the dbt project described in #821 against a variety of data platforms, and only Redshift failed. So depending on the nature of the solution, we may end up transferring this issue to dbt-redshift.

These all worked:

  • dbt-postgres
  • dbt-bigquery
  • dbt-snowflake
  • dbt-databricks

@graciegoheen graciegoheen transferred this issue from dbt-labs/dbt-core May 13, 2024
@jnapoleon
Copy link

Hey @dbeatty10 I've attached the logs

Let me know if you would rather me create a new issue?
log.txt

Thanks

John

@dbeatty10
Copy link
Contributor

Thanks for attaching your logs @jnapoleon 👍

Could you also provide a minimal code example like the "reprex" within #821 (comment)? That would allow us to determine if you are having the same issue or a different issue.

@jnapoleon
Copy link

@dbeatty10 I managed to resolve the issue, I don't think there was a bug with dbt.

There was some strange behaviour getting the data types using information schema from the created temporary table. It thought the date type was varchar for the temporary table but in the real final output model it was a numeric

The way I solved it was explicitly casting the fields as numeric in the final output model

@joellabes
Copy link

I am also seeing this. Because I'm doing cross-platform unit tests for the audit helper package, I have a diff of what snowflake generated vs redshift:
image

When nulls are present alongside numeric values, dbt is casting everything as a varchar(1), which makes unions between expected and actual impossible.

Here's the full file that ran in this CI run
reworked_compare_identical_tables_multiple_null_pk.txt

In my case, the input rows lead with the null but the expected output has a non-null value first.

unit_tests:
  - name: reworked_compare_identical_tables_multiple_null_pk
    model: unit_reworked_compare
    
    given:
      - input: ref('unit_test_model_a')
        rows:
          - { "id": , "col1": "abc", "col2": "def" }
          - { "id": , "col1": "hij", "col2": "klm" }
          - { "id": 3, "col1": "nop", "col2": "qrs" }
      - input: ref('unit_test_model_b')
        rows:
          - { "id": , "col1": "abc", "col2": "def" }
          - { "id": , "col1": "hij", "col2": "klm" }
          - { "id": 3, "col1": "nop", "col2": "qrs" }
        
    expect:
      rows:
        - {"dbt_audit_row_status": 'identical', 'id': 3, dbt_audit_num_rows_in_status: 1}
        - {"dbt_audit_row_status": 'nonunique_pk', 'id': , dbt_audit_num_rows_in_status: 2}
        - {"dbt_audit_row_status": 'nonunique_pk', 'id': , dbt_audit_num_rows_in_status: 2}

    overrides:
      vars:
        reworked_compare__columns: ['id', 'col1', 'col2']
        reworked_compare__event_time:
        reworked_compare__primary_key_columns: ['id']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unit tests Issues related to built-in dbt unit testing functionality
Projects
None yet
Development

No branches or pull requests

4 participants