Data quality next plans POC #149

elijahbenizzy · 2022-07-04T22:38:12Z

OK so this is a pure proof of concept. Not necessarily the right way to do things, and not tested. That said, I wanted to prove the following:

That we could build a two-step data quality pass (E.G. with a profiler and a validator). This will quickly be a whylogs blocker.
That we can use config to enable/disable items at run/compile time.
That we can add an applies_to keyword to narrow focus of data quality.

(1) is useful for integrations with complex stuff -- E.G. an expensive profiling step with lots of validations.
(2) is useful for disabling -- this will probably be the first we release.
(3) is useful for extract_columns -- it now makes it clear what it applies to.

While some of this code still has placeholders and isn't tested, it demonstrates feasible solutions, and de-risks the release of data quality enough to make me comfortable.

Look through commits for more explanations.

Changes

Testing

Notes

Checklist

Testing checklist

Python - local testing

python 3.6
python 3.7

This is the first-take at the initial data quality decorator. A few components: 1. check_outputs decorator -- this enables us to run a few defualt decorators 2. the DataValidator base class -- this allows us to have extensible data validators 3. the DefaultDataValidator base class -- this allows us to have a few default validators that map to args of check_outputs 4. some basic default data validators All is tested so far. Upcoming is: 1. round out the list of default data validators 2. Add documentation for check_output 3. Add end-to-end tests 4. Configure log/warn levels 5. Add documentatino for extending validators

We now have: 1. DataInRangeValidatorPandas 2. DataInRangeValidatorPrimitives 3. MaxFractionNansValidatorPandas 4. PandasSeriesDataTypeValidator 5. PandasMaxStandardDevValidator 6. PandasMeanInRangeValidator

The naming is suboptimal, will change soon. But now we have two avenues: 1. check_output (using the default validators) 2. check_output_custom (using specified, custom validators)

available list

Note that this is not perfect -- the issues are: 1. The node names collide 2. The DAG structure is weird -- ideally we'd be able to combine the dq decorators into one Next commits should address (1) and (2)

This just delegates to the MaxFractionNansValidator (its superclass's) method. This also makes name a classmethod for BaseDefaultValidator

Currently these actions are hard-coded but we might want to configure it soon. That'll come later.

same name For default validators, the name and the arg should be isomorphically related. The different classes are multiple implementations of the same mapping. This tests that property.

This allows one to query for DQ nodes and capture the results.

…ctions

This will be useful later but for now it confuses things...

This is a pretty simple approach but I think it works nicely. We use the schema= decorator to specify data quality checks that validate a pandera schema. Note that this will only be registered if pandera is installed (which is an option for setup.py). The other option we were thinking of is to compile the default checks to pandera, but I think that's a little overkill, and can be done later. Furthermore, this proves out the abstraction for data validation quite nicely -- this was straightforward to implement.

We now have a test-integrations section in config.yml. I've decided to group them together to avoid a proliferation. Should we arrive at conflicting requirements we can solve that later.

It was causing circular imports otherwise.

What this does... 1. Adds a new profiler argument in data validation decorator 2. Adds scaffolding for a whylogs class 3. Messes with DAG validation to allow for a profiler. Note that if we don't have a profiler, the DAG just runs validators on the original data. If we do, it runs validators on that. The validators then have to accept the type the profiler outputs, rather than the original data type. What this is missing 1. Testing -- its just a POC with pieces left out. Run at your own risk :) 2. Configuration-wiring -- hopefully that'll come soon. 3. Tag-exposure -- that will be useful for giving metadata to the decorator. The data is there, but its not available yet. Note there are other approaches -- I think this is nice as its flexible yet opinionated, and we can always go down to the lower API if we have something more complex.

Again, not the final, but this shows what we can do: Say one has a node called `foo`. We might want the following: 0. Disable all data validation globally 1. Disable all data validation for foo 2. Disable a few checks for foo but not all This would translate to config: 0. "data_quality.disable = True" 1. "data_quality.foo.disable = True" 2. "data_quality.foo.disable = ['check_1, 'check_2']

This is *very* rough. The idea is that we should be able to choose one of the following modes: 1. Apply a validator to every final nod ein a subdag of a decorated function 2. Apply a validator to a specific node within the subdag of a decorated function This basically allows (1) which is the default, but also (2) if using the applies_to keyword. Note that this only works if its in the final subdag (I.E. a sync), and not in the middle. We should add that but it'll be a little bit of a change. Nothing we can't make backwards compatible, we just might need to crawl back a little further in our layered API -- E.G. use subdag transformer rather than node transformer. Either way, this shows that we can ddo what we want without too many modifications. Note that this is not tested, just a proof of concept.

elijahbenizzy and others added 21 commits July 4, 2022 14:09

Adds a few default data validators

40538f4

We now have: 1. DataInRangeValidatorPandas 2. DataInRangeValidatorPrimitives 3. MaxFractionNansValidatorPandas 4. PandasSeriesDataTypeValidator 5. PandasMaxStandardDevValidator 6. PandasMeanInRangeValidator

Adds hook for custom validators

4668e45

The naming is suboptimal, will change soon. But now we have two avenues: 1. check_output (using the default validators) 2. check_output_custom (using specified, custom validators)

Adds some (WIP) documentation for data quality

75a142c

Adds test to ensure all base default validators are added to the

bc4c17e

available list

Adds support for layering data quality decorators

7ecde0f

Note that this is not perfect -- the issues are: 1. The node names collide 2. The DAG structure is weird -- ideally we'd be able to combine the dq decorators into one Next commits should address (1) and (2)

Adds NansAllowedValidator

8ce8399

This just delegates to the MaxFractionNansValidator (its superclass's) method. This also makes name a classmethod for BaseDefaultValidator

Adds actions for failures.

b3f6e37

Currently these actions are hard-coded but we might want to configure it soon. That'll come later.

Adds validation to ensure that all validators with the same arg have the

14b9f73

same name For default validators, the name and the arg should be isomorphically related. The different classes are multiple implementations of the same mapping. This tests that property.

Adds tags for data quality nodes

9478e65

This allows one to query for DQ nodes and capture the results.

Adds end-to-end tests for data quality

8308c36

Adds tests to ensure that constants don't change

33a19ce

Adds formatting change for pre-commit

1d312a4

Attempt to fix imports

aec2495

Changes imports to module-specific rather than individual classes/fun…

c033706

…ctions

Small changes for PR

5d1d742

Removes currently unecessary dependencies/config

8c9bbb1

This will be useful later but for now it confuses things...

Removes unecessary code

c37232c

Adds sections to test external integrations in config.yml

7eaa049

We now have a test-integrations section in config.yml. I've decided to group them together to avoid a proliferation. Should we arrive at conflicting requirements we can solve that later.

Moves BaseDefaultValidator to base

b1fc1ef

It was causing circular imports otherwise.

elijahbenizzy changed the title ~~Data quality next plans poc~~ Data quality next plans POC Jul 4, 2022

elijahbenizzy mentioned this pull request Jul 4, 2022

Data quality #115

Merged

12 tasks

elijahbenizzy force-pushed the data-quality-next-plans-POC branch from 6ccc12a to a407ca1 Compare July 5, 2022 15:20

elijahbenizzy force-pushed the data-quality-pandera branch 2 times, most recently from c61ad65 to f00b799 Compare July 5, 2022 17:09

elijahbenizzy added 3 commits July 6, 2022 08:32

elijahbenizzy force-pushed the data-quality-next-plans-POC branch from a407ca1 to 4e6f081 Compare July 6, 2022 15:33

Base automatically changed from data-quality-pandera to data-quality July 6, 2022 15:33

Base automatically changed from data-quality to main July 13, 2022 17:07

This was referenced Aug 2, 2022

Configurable data quality checks #164

Closed

Ability to data profile node outputs for creating data quality checks #165

Closed

This was referenced Feb 26, 2023

Data quality next plans POC DAGWorks-Inc/hamilton#33

Closed

Configurable data quality checks DAGWorks-Inc/hamilton#39

Closed

Ability to data profile node outputs for creating data quality checks DAGWorks-Inc/hamilton#40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data quality next plans POC #149

Data quality next plans POC #149

elijahbenizzy commented Jul 4, 2022 •

edited

Data quality next plans POC #149

Are you sure you want to change the base?

Data quality next plans POC #149

Conversation

elijahbenizzy commented Jul 4, 2022 • edited

Changes

Testing

Notes

Checklist

Testing checklist

Python - local testing

elijahbenizzy commented Jul 4, 2022 •

edited