Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expect_column_pair_values_to_be_in_set throws exception when row has both column values to be paired missing #9577

Open
MaxTh0ma1s opened this issue Mar 5, 2024 · 1 comment

Comments

@MaxTh0ma1s
Copy link

Describe the bug
Simple expect_column_pair_values_to_be_in_set expectation throws exception when row has both column values to be paired missing

To Reproduce
Basic setup, with the following expectation configured ...

{
"expectation_type": "expect_column_pair_values_to_be_in_set",
"kwargs": {
"column_A": "mycolA",
"column_B": "mycolB",
"value_pairs_set": [["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]
}
}

Sample data to reproduce

id,mycolA,mycolB,valid
1,apple,red,pass
2,apple,green,pass
3,apple,yellow,pass
4,peach,peach,fail
5,banana,yellow,pass
6,banana,black,fail
7,,,fail
8,melon,melon,fail

An exception is raised

    "exception_info": {
      "exception_traceback": "Traceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 548, in _process_direct_and_bundled_metric_computation_configurations\n    ] = metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/expectations/metrics/map_metric_provider/map_condition_auxilliary_methods.py\", line 201, in _pandas_map_condition_query\n    domain_values_df_filtered = domain_records_df[boolean_mapped_unexpected_values]\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3884, in __getitem__\n    return self._getitem_bool_array(key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3940, in _getitem_bool_array\n    key = check_bool_indexer(self.index, key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/indexing.py\", line 2575, in check_bool_indexer\n    raise IndexingError(\npandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/validator/validation_graph.py\", line 285, in _resolve\n    self._execution_engine.resolve_metrics(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 283, in resolve_metrics\n    return self._process_direct_and_bundled_metric_computation_configurations(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 552, in _process_direct_and_bundled_metric_computation_configurations\n    raise gx_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n",
      "exception_message": "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).",
      "raised_exception": true
    }

Expected behavior
I do not expect basic use of this expectation on simple data to throw an exception.

Environment (please complete the following information):

  • Operating System: MacOS
  • Great Expectations Version: 0.18.8
  • Data Source: .csv sample data listed above
  • Pandas Version: 2.1.4

Additional context
Note issue was reproduced by Rachel House
https://discourse.greatexpectations.io/t/how-to-specify-expect-column-pair-values-to-be-in-set-value-pairs-set-input-arg-via-json/1621/5

@rachhouse
Copy link
Contributor

Hi @MaxTh0ma1s, thanks for creating this issue. I discussed the error and behavior internally with Engineering today - they're now aware of it, but I don't know when a fix will be prioritized. I added my code to reproduce the issue to aid investigation.

As a workaround for now, I suggest modifying your source dataframe using .fillna() to replace the NaN values with another suitable non-null value (as mentioned in the associated Discourse thread).

Reproduced using:

great-expectations==0.18.8
pandas==2.1.3

Code to reproduce:

import pandas as pd
import great_expectations as gx

context = gx.get_context()

# Dataset containing row with NaNs as final row.
data_1 = [
    { "idx" : 1, "fruit" : "apple", "color" : "red" },
    { "idx" : 2, "fruit" : "apple", "color" : "green" },
    { "idx" : 3, "fruit" : "apple", "color" : "yellow" },
    { "idx" : 4, "fruit" : "peach", "color" : "red" },
    { "idx" : 5, "fruit" : "banana", "color" : "yellow" },
    { "idx" : 6, "fruit" : "banana", "color" : "black" },
    { "idx" : 7},
]

df_1 = pd.DataFrame(data=data_1)

# Dataset containing row with NaNs, but not as final row.
data_2 = [
    { "idx" : 1, "fruit" : "apple", "color" : "red" },
    { "idx" : 2, "fruit" : "apple", "color" : "green" },
    { "idx" : 3, "fruit" : "apple", "color" : "yellow" },
    { "idx" : 4, "fruit" : "peach", "color" : "peach" },
    { "idx" : 5, "fruit" : "banana", "color" : "yellow" },
    { "idx" : 6, "fruit" : "banana", "color" : "black" },
    { "idx" : 7},
    { "idx" : 8, "fruit" : "melon", "color" : "melon" },
]

df_2 = pd.DataFrame(data=data_2)

DATA_SOURCE_NAME = "pandas-datasource"
DATA_ASSET_NAME = "pandas-dataframe"
EXPECTATION_SUITE_NAME = "expectations"
CHECKPOINT_NAME = "checkpoint"

data_source = context.sources.add_pandas(name=DATA_SOURCE_NAME)
data_asset = data_source.add_dataframe_asset(name=DATA_ASSET_NAME)

# When using df_1, Checkpoint runs successfully.
# Using df_2 causes the Checkpoint to error when running the Expectation.
batch_request = data_asset.build_batch_request(dataframe=df_2)

suite = context.add_or_update_expectation_suite(EXPECTATION_SUITE_NAME)

set_expectation = gx.core.ExpectationConfiguration(
    expectation_type="expect_column_pair_values_to_be_in_set",
    kwargs={
        "column_A": "fruit",
        "column_B": "color",
        "value_pairs_set": [["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]
    }
)

suite.add_expectation_configurations([set_expectation])

context.update_expectation_suite(expectation_suite=suite)

checkpoint_config = {
	"name": CHECKPOINT_NAME,
	"action_list": [],
	"validations": [{
		"expectation_suite_name": suite.expectation_suite_name,
		"batch_request": {
			"datasource_name": data_source.name,
			"data_asset_name": data_asset.name,
		},		
	}],
	"config_version": 1,
	"class_name": "Checkpoint"
}

checkpoint = context.add_or_update_checkpoint(**checkpoint_config)

checkpoint_result = checkpoint.run()

validation_result_name = list(checkpoint_result["run_results"].keys())[0]
checkpoint_result["run_results"][validation_result_name]["validation_result"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants