expect_column_pair_values_to_be_in_set throws exception when row has both column values to be paired missing #9577

MaxTh0ma1s · 2024-03-05T18:01:48Z

Describe the bug
Simple expect_column_pair_values_to_be_in_set expectation throws exception when row has both column values to be paired missing

To Reproduce
Basic setup, with the following expectation configured ...

{
"expectation_type": "expect_column_pair_values_to_be_in_set",
"kwargs": {
"column_A": "mycolA",
"column_B": "mycolB",
"value_pairs_set": [["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]
}
}

Sample data to reproduce

id,mycolA,mycolB,valid
1,apple,red,pass
2,apple,green,pass
3,apple,yellow,pass
4,peach,peach,fail
5,banana,yellow,pass
6,banana,black,fail
7,,,fail
8,melon,melon,fail

An exception is raised

    "exception_info": {
      "exception_traceback": "Traceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 548, in _process_direct_and_bundled_metric_computation_configurations\n    ] = metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/expectations/metrics/map_metric_provider/map_condition_auxilliary_methods.py\", line 201, in _pandas_map_condition_query\n    domain_values_df_filtered = domain_records_df[boolean_mapped_unexpected_values]\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3884, in __getitem__\n    return self._getitem_bool_array(key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3940, in _getitem_bool_array\n    key = check_bool_indexer(self.index, key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/indexing.py\", line 2575, in check_bool_indexer\n    raise IndexingError(\npandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/validator/validation_graph.py\", line 285, in _resolve\n    self._execution_engine.resolve_metrics(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 283, in resolve_metrics\n    return self._process_direct_and_bundled_metric_computation_configurations(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 552, in _process_direct_and_bundled_metric_computation_configurations\n    raise gx_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n",
      "exception_message": "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).",
      "raised_exception": true
    }

Expected behavior
I do not expect basic use of this expectation on simple data to throw an exception.

Environment (please complete the following information):

Operating System: MacOS
Great Expectations Version: 0.18.8
Data Source: .csv sample data listed above
Pandas Version: 2.1.4

Additional context
Note issue was reproduced by Rachel House
https://discourse.greatexpectations.io/t/how-to-specify-expect-column-pair-values-to-be-in-set-value-pairs-set-input-arg-via-json/1621/5

The text was updated successfully, but these errors were encountered:

rachhouse · 2024-03-06T21:45:02Z

Hi @MaxTh0ma1s, thanks for creating this issue. I discussed the error and behavior internally with Engineering today - they're now aware of it, but I don't know when a fix will be prioritized. I added my code to reproduce the issue to aid investigation.

As a workaround for now, I suggest modifying your source dataframe using .fillna() to replace the NaN values with another suitable non-null value (as mentioned in the associated Discourse thread).

Reproduced using:

great-expectations==0.18.8
pandas==2.1.3

Code to reproduce:

import pandas as pd
import great_expectations as gx

context = gx.get_context()

# Dataset containing row with NaNs as final row.
data_1 = [
    { "idx" : 1, "fruit" : "apple", "color" : "red" },
    { "idx" : 2, "fruit" : "apple", "color" : "green" },
    { "idx" : 3, "fruit" : "apple", "color" : "yellow" },
    { "idx" : 4, "fruit" : "peach", "color" : "red" },
    { "idx" : 5, "fruit" : "banana", "color" : "yellow" },
    { "idx" : 6, "fruit" : "banana", "color" : "black" },
    { "idx" : 7},
]

df_1 = pd.DataFrame(data=data_1)

# Dataset containing row with NaNs, but not as final row.
data_2 = [
    { "idx" : 1, "fruit" : "apple", "color" : "red" },
    { "idx" : 2, "fruit" : "apple", "color" : "green" },
    { "idx" : 3, "fruit" : "apple", "color" : "yellow" },
    { "idx" : 4, "fruit" : "peach", "color" : "peach" },
    { "idx" : 5, "fruit" : "banana", "color" : "yellow" },
    { "idx" : 6, "fruit" : "banana", "color" : "black" },
    { "idx" : 7},
    { "idx" : 8, "fruit" : "melon", "color" : "melon" },
]

df_2 = pd.DataFrame(data=data_2)

DATA_SOURCE_NAME = "pandas-datasource"
DATA_ASSET_NAME = "pandas-dataframe"
EXPECTATION_SUITE_NAME = "expectations"
CHECKPOINT_NAME = "checkpoint"

data_source = context.sources.add_pandas(name=DATA_SOURCE_NAME)
data_asset = data_source.add_dataframe_asset(name=DATA_ASSET_NAME)

# When using df_1, Checkpoint runs successfully.
# Using df_2 causes the Checkpoint to error when running the Expectation.
batch_request = data_asset.build_batch_request(dataframe=df_2)

suite = context.add_or_update_expectation_suite(EXPECTATION_SUITE_NAME)

set_expectation = gx.core.ExpectationConfiguration(
    expectation_type="expect_column_pair_values_to_be_in_set",
    kwargs={
        "column_A": "fruit",
        "column_B": "color",
        "value_pairs_set": [["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]
    }
)

suite.add_expectation_configurations([set_expectation])

context.update_expectation_suite(expectation_suite=suite)

checkpoint_config = {
	"name": CHECKPOINT_NAME,
	"action_list": [],
	"validations": [{
		"expectation_suite_name": suite.expectation_suite_name,
		"batch_request": {
			"datasource_name": data_source.name,
			"data_asset_name": data_asset.name,
		},		
	}],
	"config_version": 1,
	"class_name": "Checkpoint"
}

checkpoint = context.add_or_update_checkpoint(**checkpoint_config)

checkpoint_result = checkpoint.run()

validation_result_name = list(checkpoint_result["run_results"].keys())[0]
checkpoint_result["run_results"][validation_result_name]["validation_result"]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expect_column_pair_values_to_be_in_set throws exception when row has both column values to be paired missing #9577

expect_column_pair_values_to_be_in_set throws exception when row has both column values to be paired missing #9577

MaxTh0ma1s commented Mar 5, 2024

rachhouse commented Mar 6, 2024

expect_column_pair_values_to_be_in_set throws exception when row has both column values to be paired missing #9577

expect_column_pair_values_to_be_in_set throws exception when row has both column values to be paired missing #9577

Comments

MaxTh0ma1s commented Mar 5, 2024

rachhouse commented Mar 6, 2024