[MAINTENANCE] Remove batching regex from FilePathDataConnector #9898

joshua-stauffer · 2024-05-07T21:22:08Z

This PR updates FileDataAssets and DirectoryDataAssets to not require a batching_regex parameter on instantiation.

Changes

batching_regex has been removed from the init signature of all Assets and DataConnectors

Other things to note

Implementation of FilePathDataConnector has been slightly refactored to make it clearer that its behavior differs significantly based on what sort of Asset it is serving.
Some FileDataAssets did preprocessing of their batching regex inside their init method - that has now been moved to the _preprocess_batching_regex method.
test_connection logic for FileDataAsset and DirectoryDataAsset is not that useful in V1, since the specific batching_regex used to identify files is no longer owned by the asset. The majority of those tests have been removed, and future work will add test_connection functionality to the BatchDefinition.
several versions of test_get_batch_list_from_fully_specified_batch_request for specific datasource/assets were marked as xfail for reasons that proved not to be accurate. They've been removed, and we should verify that we have adequate test coverage for those code paths. They've all been annotated with comments in this PR.

…_sig

netlify · 2024-05-07T21:22:25Z

✅ Deploy Preview for niobium-lead-7998 canceled.

Name	Link
🔨 Latest commit	`b9e6f81`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/66463ae1501401000824abbc

…_expectations into m/v1-290/remove_batching_regex_from_asset_sig

codecov · 2024-05-09T13:08:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.83%. Comparing base (8de0176) to head (b9e6f81).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9898      +/-   ##
===========================================
- Coverage    77.85%   77.83%   -0.02%     
===========================================
  Files          454      454              
  Lines        39517    39533      +16     
===========================================
+ Hits         30766    30771       +5     
- Misses        8751     8762      +11

Flag	Coverage Δ
3.10	`64.42% <71.42%> (+0.03%)`	⬆️
3.10 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.10 aws_deps	`?`
3.10 big	`?`
3.10 databricks	`?`
3.10 filesystem	`?`
3.10 mssql	`?`
3.10 mysql	`?`
3.10 postgresql	`?`
3.10 snowflake	`?`
3.10 spark	`?`
3.10 trino	`?`
3.11	`64.42% <71.42%> (+0.03%)`	⬆️
3.11 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`53.53% <40.81%> (-0.09%)`	⬇️
3.11 aws_deps	`44.92% <71.42%> (+0.12%)`	⬆️
3.11 big	`53.65% <51.02%> (-0.20%)`	⬇️
3.11 databricks	`46.01% <32.65%> (-0.07%)`	⬇️
3.11 filesystem	`59.31% <77.55%> (-0.02%)`	⬇️
3.11 mssql	`48.99% <32.65%> (-0.07%)`	⬇️
3.11 mysql	`49.06% <32.65%> (-0.07%)`	⬇️
3.11 postgresql	`53.18% <32.65%> (-0.07%)`	⬇️
3.11 snowflake	`46.68% <32.65%> (-0.07%)`	⬇️
3.11 spark	`56.40% <81.63%> (-0.12%)`	⬇️
3.11 trino	`51.10% <32.65%> (-0.07%)`	⬇️
3.8	`64.45% <71.42%> (+0.03%)`	⬆️
3.8 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`53.53% <40.81%> (-0.10%)`	⬇️
3.8 aws_deps	`44.93% <71.42%> (+0.12%)`	⬆️
3.8 big	`53.66% <51.02%> (-0.20%)`	⬇️
3.8 databricks	`46.02% <32.65%> (-0.07%)`	⬇️
3.8 filesystem	`59.33% <77.55%> (-0.02%)`	⬇️
3.8 mssql	`48.98% <32.65%> (-0.07%)`	⬇️
3.8 mysql	`49.04% <32.65%> (-0.07%)`	⬇️
3.8 postgresql	`53.17% <32.65%> (-0.07%)`	⬇️
3.8 snowflake	`46.70% <32.65%> (-0.07%)`	⬇️
3.8 spark	`56.37% <81.63%> (-0.12%)`	⬇️
3.8 trino	`51.08% <32.65%> (-0.07%)`	⬇️
3.9	`64.44% <71.42%> (+0.03%)`	⬆️
3.9 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.9 aws_deps	`?`
3.9 big	`?`
3.9 databricks	`?`
3.9 filesystem	`?`
3.9 mssql	`?`
3.9 mysql	`?`
3.9 postgresql	`?`
3.9 spark	`?`
3.9 trino	`?`
cloud	`0.00% <0.00%> (ø)`
docs-basic	`47.63% <63.26%> (+<0.01%)`	⬆️
docs-creds-needed	`48.76% <85.71%> (+0.01%)`	⬆️
docs-spark	`47.57% <63.26%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…_expectations into m/v1-290/remove_batching_regex_from_asset_sig

joshua-stauffer · 2024-05-15T21:00:48Z

tests/datasource/fluent/test_pandas_azure_blob_storage_datasource.py

-@pytest.mark.big
-@pytest.mark.xfail(
-    reason="Accessing objects on azure.storage.blob using Pandas is not working, due to local credentials issues (this test is conducted using Jupyter notebook manually)."  # noqa: E501
-)
-def test_get_batch_list_from_fully_specified_batch_request(
-    monkeypatch: pytest.MonkeyPatch,
-    pandas_abs_datasource: PandasAzureBlobStorageDatasource,
-):
-    azure_client: azure.BlobServiceClient = cast(azure.BlobServiceClient, MockBlobServiceClient())
-
-    def instantiate_azure_client_spy(self) -> None:
-        self._azure_client = azure_client
-
-    monkeypatch.setattr(
-        great_expectations.execution_engine.pandas_execution_engine.PandasExecutionEngine,
-        "_instantiate_s3_client",
-        instantiate_azure_client_spy,
-        raising=True,
-    )
-    asset = pandas_abs_datasource.add_csv_asset(
-        name="csv_asset",
-        batching_regex=r"(?P<name>.+)_(?P<timestamp>.+)_(?P<price>\d{4})\.csv",
-        abs_container="my_container",
-    )
-
-    request = asset.build_batch_request({"name": "alex", "timestamp": "20200819", "price": "1300"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 1
-    batch = batches[0]
-    assert batch.batch_request.datasource_name == pandas_abs_datasource.name
-    assert batch.batch_request.data_asset_name == asset.name
-    assert batch.batch_request.options == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-    }
-    assert batch.metadata == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-    }
-    assert batch.id == "pandas_abs_datasource-csv_asset-name_alex-timestamp_20200819-price_1300"
-
-    request = asset.build_batch_request({"name": "alex"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 2


it looks like this test was never completed; i don't see any issue with "local credentials" as mentioned in the pytest.mark.skip, it just looks like the mock ABS backend was never finished/implemented correctly.

it raises a TestConnectionError because the mock ABS client isn't returning anything:

def test_connection(self) -> None: """Test the connection for the DataAsset. Raises: TestConnectionError: If the connection test fails. """ try: if self._data_connector.test_connection(): return None except Exception as e: raise TestConnectionError( # noqa: TRY003 f"Could not connect to asset using {type(self._data_connector).__name__}: Got {type(e).__name__}" # noqa: E501 ) from e > raise TestConnectionError(self._test_connection_error_message) E great_expectations.datasource.fluent.interfaces.TestConnectionError: No file in bucket "test_bucket" with prefix "" and recursive file discovery set to "False" matched regular expressions pattern "(?P<name>.+)_(?P<timestamp>.+)_(?P<price>\d{4})\.csv" using delimiter "/" for DataAsset "csv_asset".

joshua-stauffer · 2024-05-15T21:03:13Z

tests/datasource/fluent/test_pandas_google_cloud_storage_datasource.py

-@pytest.mark.big
-@pytest.mark.xfail(
-    reason="Accessing objects on google.cloud.storage using Pandas is not working, due to local credentials issues (this test is conducted using Jupyter notebook manually)."  # noqa: E501
-)
-def test_get_batch_list_from_fully_specified_batch_request(
-    monkeypatch: pytest.MonkeyPatch,
-    pandas_gcs_datasource: PandasGoogleCloudStorageDatasource,
-):
-    gcs_client: google.Client = cast(google.Client, MockGCSClient())
-
-    def instantiate_gcs_client_spy(self) -> None:
-        self._gcs = gcs_client
-
-    monkeypatch.setattr(
-        great_expectations.execution_engine.pandas_execution_engine.PandasExecutionEngine,
-        "_instantiate_s3_client",
-        instantiate_gcs_client_spy,
-        raising=True,
-    )
-    asset = pandas_gcs_datasource.add_csv_asset(
-        name="csv_asset",
-        batching_regex=r"(?P<name>.+)_(?P<timestamp>.+)_(?P<price>\d{4})\.csv",
-    )
-
-    request = asset.build_batch_request({"name": "alex", "timestamp": "20200819", "price": "1300"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 1
-    batch = batches[0]
-    assert batch.batch_request.datasource_name == pandas_gcs_datasource.name
-    assert batch.batch_request.data_asset_name == asset.name
-    assert batch.batch_request.options == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-    }
-    assert batch.metadata == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-    }
-    assert batch.id == "pandas_gcs_datasource-csv_asset-name_alex-timestamp_20200819-price_1300"
-
-    request = asset.build_batch_request({"name": "alex"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 2
-


I didn't dig into this test as deeply as I did the pandas ABS test above, but pytest.mark.skip message is exactly the same, and the implementation looks almost identical, and it also doesn't pass, so it seems safe to assume that this test was also never completed.

joshua-stauffer · 2024-05-15T21:05:52Z

tests/datasource/fluent/test_spark_azure_blob_storage_datasource.py

-@pytest.mark.big
-@pytest.mark.xfail(
-    reason="Accessing objects on azure.storage.blob using Spark is not working, due to local credentials issues (this test is conducted using Jupyter notebook manually)."  # noqa: E501
-)
-def test_get_batch_list_from_fully_specified_batch_request(
-    monkeypatch: pytest.MonkeyPatch,
-    spark_abs_datasource: SparkAzureBlobStorageDatasource,
-):
-    azure_client: azure.BlobServiceClient = cast(azure.BlobServiceClient, MockBlobServiceClient())
-
-    def instantiate_azure_client_spy(self) -> None:
-        self._azure_client = azure_client
-
-    monkeypatch.setattr(
-        great_expectations.execution_engine.sparkdf_execution_engine.SparkDFExecutionEngine,
-        "_instantiate_s3_client",
-        instantiate_azure_client_spy,
-        raising=True,
-    )
-    asset_specified_metadata = {"asset_level_metadata": "my_metadata"}
-    asset = spark_abs_datasource.add_csv_asset(
-        name="csv_asset",
-        batching_regex=r"(?P<name>.+)_(?P<timestamp>.+)_(?P<price>\d{4})\.csv",
-        abs_container="my_container",
-        batch_metadata=asset_specified_metadata,
-    )
-
-    request = asset.build_batch_request({"name": "alex", "timestamp": "20200819", "price": "1300"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 1
-    batch = batches[0]
-    assert batch.batch_request.datasource_name == spark_abs_datasource.name
-    assert batch.batch_request.data_asset_name == asset.name
-    assert batch.batch_request.options == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-    }
-    assert batch.metadata == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-        **asset_specified_metadata,
-    }
-    assert batch.id == "spark_abs_datasource-csv_asset-name_alex-timestamp_20200819-price_1300"
-
-    request = asset.build_batch_request({"name": "alex"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 2


this test fails because the SparkExecutionEngine doesn't have the attribute that we're trying to patch:

> monkeypatch.setattr( great_expectations.execution_engine.sparkdf_execution_engine.SparkDFExecutionEngine, "_instantiate_s3_client", instantiate_azure_client_spy, raising=True, ) E AttributeError: <class 'great_expectations.execution_engine.sparkdf_execution_engine.SparkDFExecutionEngine'> has no attribute '_instantiate_s3_client' test_spark_azure_blob_storage_datasource.py:354: AttributeError

As far as I can tell this test was never completed.

joshua-stauffer · 2024-05-15T21:10:41Z

tests/datasource/fluent/test_spark_google_cloud_storage_datasource.py

-@pytest.mark.big
-@pytest.mark.xfail(
-    reason="Accessing objects on google.cloud.storage using Spark is not working, due to local credentials issues (this test is conducted using Jupyter notebook manually)."  # noqa: E501
-)
-def test_get_batch_list_from_fully_specified_batch_request(
-    monkeypatch: pytest.MonkeyPatch,
-    spark_gcs_datasource: SparkGoogleCloudStorageDatasource,
-):
-    gcs_client: google.Client = cast(google.Client, MockGCSClient())
-
-    def instantiate_gcs_client_spy(self) -> None:
-        self._gcs = gcs_client
-
-    monkeypatch.setattr(
-        great_expectations.execution_engine.sparkdf_execution_engine.SparkDFExecutionEngine,
-        "_instantiate_s3_client",
-        instantiate_gcs_client_spy,
-        raising=True,
-    )
-    asset_specified_metadata = {"asset_level_metadata": "my_metadata"}
-    asset = spark_gcs_datasource.add_csv_asset(
-        name="csv_asset",
-        batching_regex=r"(?P<name>.+)_(?P<timestamp>.+)_(?P<price>\d{4})\.csv",
-        batch_metadata=asset_specified_metadata,
-    )
-
-    request = asset.build_batch_request({"name": "alex", "timestamp": "20200819", "price": "1300"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 1
-    batch = batches[0]
-    assert batch.batch_request.datasource_name == spark_gcs_datasource.name
-    assert batch.batch_request.data_asset_name == asset.name
-    assert batch.batch_request.options == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-    }
-    assert batch.metadata == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-        **asset_specified_metadata,
-    }
-    assert batch.id == "spark_gcs_datasource-csv_asset-name_alex-timestamp_20200819-price_1300"
-
-    request = asset.build_batch_request({"name": "alex"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 2


this test fails because the SparkExecutionEngine doesn't have the attribute that we're trying to patch:

> monkeypatch.setattr( great_expectations.execution_engine.sparkdf_execution_engine.SparkDFExecutionEngine, "_instantiate_s3_client", instantiate_gcs_client_spy, raising=True, ) E AttributeError: <class 'great_expectations.execution_engine.sparkdf_execution_engine.SparkDFExecutionEngine'> has no attribute '_instantiate_s3_client'

As far as I can tell this test was never completed.

the execution engine also doesn't have _instantiate_gcs_client, which would make more sense than _isntantiate_s3_client as written.

joshua-stauffer · 2024-05-15T21:14:34Z

tests/datasource/fluent/test_spark_s3_datasource.py

-@pytest.mark.big
-@pytest.mark.xfail(
-    reason="Accessing objects on moto.mock_s3 using Spark is not working (this test is conducted using Jupyter notebook manually)."  # noqa: E501
-)
-def test_get_batch_list_from_fully_specified_batch_request(
-    spark_s3_datasource: SparkS3Datasource,
-):
-    asset_specified_metadata = {"asset_level_metadata": "my_metadata"}
-    asset = spark_s3_datasource.add_csv_asset(
-        name="csv_asset",
-        batching_regex=r"(?P<name>.+)_(?P<timestamp>.+)_(?P<price>\d{4})\.csv",
-        header=True,
-        infer_schema=True,
-        batch_metadata=asset_specified_metadata,
-    )
+        asset.build_batch_request({"year": "2024", "month": 5})

-    request = asset.build_batch_request({"name": "alex", "timestamp": "20200819", "price": "1300"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 1
-    batch = batches[0]
-    assert batch.batch_request.datasource_name == spark_s3_datasource.name
-    assert batch.batch_request.data_asset_name == asset.name
-    assert batch.batch_request.options == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-    }
-    assert batch.metadata == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-        **asset_specified_metadata,
-    }
-    assert batch.id == "spark_s3_datasource-csv_asset-name_alex-timestamp_20200819-price_1300"

-    request = asset.build_batch_request({"name": "alex"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 2


this test fails with an opaque Py4JJavaError which makes me wonder if this datasource is actually functional:

request = asset.build_batch_request({"name": "alex", "timestamp": "20200819", "price": "1300"}) > batches = asset.get_batch_list_from_batch_request(request) test_spark_s3_datasource.py:214: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../great_expectations/datasource/fluent/data_asset/path/path_data_asset.py:144: in get_batch_list_from_batch_request data, markers = execution_engine.get_batch_data_and_markers(batch_spec=batch_spec) ../../../great_expectations/execution_engine/sparkdf_execution_engine.py:515: in get_batch_data_and_markers batch_data = reader_fn(path) ../../../env/lib/python3.10/site-packages/pyspark/sql/readwriter.py:727: in csv return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) ../../../env/lib/python3.10/site-packages/py4j/java_gateway.py:1322: in __call__ return_value = get_return_value( ../../../env/lib/python3.10/site-packages/pyspark/errors/exceptions/captured.py:169: in deco return f(*a, **kw) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ answer = 'xro29' gateway_client = <py4j.clientserver.JavaClient object at 0x2898baef0> target_id = 'o26', name = 'csv' def get_return_value(answer, gateway_client, target_id=None, name=None): """Converts an answer received from the Java gateway into a Python object. For example, string representation of integers are converted to Python integer, string representation of objects are converted to JavaObject instances, etc. :param answer: the string returned by the Java gateway :param gateway_client: the gateway client used to communicate with the Java Gateway. Only necessary if the answer is a reference (e.g., object, list, map) :param target_id: the name of the object from which the answer comes from (e.g., *object1* in `object1.hello()`). Optional. :param name: the name of the member from which the answer comes from (e.g., *hello* in `object1.hello()`). Optional. """ if is_error(answer)[0]: if len(answer) > 1: type = answer[1] value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) if answer[1] == REFERENCE_TYPE: > raise Py4JJavaError( "An error occurred while calling {0}{1}{2}.\n". format(target_id, ".", name), value) E py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv. E : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found E at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688) E at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431) E at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466) E at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) E at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) E at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) E at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) E at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) E at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:724) E at scala.collection.immutable.List.map(List.scala:293) E at org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:722) E at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:551) E

I believe one needs to install an AWS provided jar to get spark to be able to read from s3.

…://github.com/great-expectations/great_expectations into m/v1-290/remove_batching_regex_from_asset_sig

great_expectations/datasource/fluent/data_connector/file_path_data_connector.py

…_expectations into m/v1-290/remove_batching_regex_from_asset_sig

for more information, see https://pre-commit.ci

billdirks

I left a comment and a question but nothing that is blocking. Thanks!

billdirks · 2024-05-16T18:34:15Z

great_expectations/datasource/fluent/data_asset/path/file_asset.py

+            return []
+        regex_parser = RegExParser(
+            regex_pattern=partitioner.regex,
+            unnamed_regex_group_prefix=self._unnamed_regex_param_prefix,


Do we need these unamed_regex_param_prefix anymore?

billdirks · 2024-05-16T18:37:22Z

great_expectations/datasource/fluent/data_connector/file_path_data_connector.py

@@ -143,7 +136,10 @@ def build_batch_spec(self, batch_definition: LegacyBatchDefinition) -> PathBatch
    # Interface Method
    @override
    def get_data_reference_count(self) -> int:
-        data_references = self._get_data_references_cache(batching_regex=self._batching_regex)
+        # todo: in the world of BatchDefinition, this method must accept a BatchRequest.


Can we file a ticket for this.

billdirks · 2024-05-16T18:51:23Z

tests/datasource/fluent/test_spark_s3_datasource.py

-@pytest.mark.big
-@pytest.mark.xfail(
-    reason="Accessing objects on moto.mock_s3 using Spark is not working (this test is conducted using Jupyter notebook manually)."  # noqa: E501
-)
-def test_get_batch_list_from_fully_specified_batch_request(
-    spark_s3_datasource: SparkS3Datasource,
-):
-    asset_specified_metadata = {"asset_level_metadata": "my_metadata"}
-    asset = spark_s3_datasource.add_csv_asset(
-        name="csv_asset",
-        batching_regex=r"(?P<name>.+)_(?P<timestamp>.+)_(?P<price>\d{4})\.csv",
-        header=True,
-        infer_schema=True,
-        batch_metadata=asset_specified_metadata,
-    )
+        asset.build_batch_request({"year": "2024", "month": 5})

-    request = asset.build_batch_request({"name": "alex", "timestamp": "20200819", "price": "1300"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 1
-    batch = batches[0]
-    assert batch.batch_request.datasource_name == spark_s3_datasource.name
-    assert batch.batch_request.data_asset_name == asset.name
-    assert batch.batch_request.options == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-    }
-    assert batch.metadata == {
-        "path": "alex_20200819_1300.csv",
-        "name": "alex",
-        "timestamp": "20200819",
-        "price": "1300",
-        **asset_specified_metadata,
-    }
-    assert batch.id == "spark_s3_datasource-csv_asset-name_alex-timestamp_20200819-price_1300"

-    request = asset.build_batch_request({"name": "alex"})
-    batches = asset.get_batch_list_from_batch_request(request)
-    assert len(batches) == 2


I believe one needs to install an AWS provided jar to get spark to be able to read from s3.

joshua-stauffer added 3 commits May 7, 2024 16:21

wip

8f7754a

Merge branch 'develop' into m/v1-290/remove_batching_regex_from_asset…

8b2b948

…_sig

remove batching regex from FilePathDataConnector

46547ce

joshua-stauffer and others added 5 commits May 7, 2024 17:25

docstrings

6cf78be

fix unit tests

b256a90

remove regex from assets

845f17c

fix some tests

981fae7

schema sync

4840009

cdkini self-assigned this May 8, 2024

cdkini added 16 commits May 8, 2024 08:33

patch test_config.py

53846b4

clean up more tests

65f159c

patch filesystem data connector tests

afc5609

Merge branch 'develop' of https://github.com/great-expectations/great…

cc3a83f

…_expectations into m/v1-290/remove_batching_regex_from_asset_sig

Merge branch 'develop' of https://github.com/great-expectations/great…

0a6b5f3

…_expectations into m/v1-290/remove_batching_regex_from_asset_sig

Merge branch 'develop' of https://github.com/great-expectations/great…

0336144

…_expectations into m/v1-290/remove_batching_regex_from_asset_sig

update dbfs connector

50ba788

more cleanup

2fb1e03

update partitioner name

e41fb92

get pandas filesystem passing

be24e0d

more work

99e495f

more progress

d33dd65

misc updates

0e366bd

schema sync

729ea7c

unit test patch

2338e77

spark tests

f93eaa9

cdkini added 3 commits May 9, 2024 10:12

mypy

7364ec9

Merge branch 'develop' of https://github.com/great-expectations/great…

a99fb2c

…_expectations into m/v1-290/remove_batching_regex_from_asset_sig

mypy

698f94a

cdkini and others added 2 commits May 15, 2024 16:52

gcs test

6f83cfa

cleanup azure tests

ee31783

joshua-stauffer commented May 15, 2024

View reviewed changes

cdkini and others added 7 commits May 15, 2024 17:26

patch test_get_batch_list_from_fully_specified_batch_request

ebca39f

fix how_to_connect_to_data_on_s3_using_pandas.py

53fff98

start on docs errors

a0006b5

Merge branch 'm/v1-290/remove_batching_regex_from_asset_sig' of https…

0aabd8d

…://github.com/great-expectations/great_expectations into m/v1-290/remove_batching_regex_from_asset_sig

test moving preprocess step

3ef4691

preprocess another instance

b9cbaea

add comments

0a29b23

cdkini reviewed May 15, 2024

View reviewed changes

great_expectations/datasource/fluent/data_connector/file_path_data_connector.py Outdated Show resolved Hide resolved

cdkini and others added 10 commits May 16, 2024 10:17

Merge branch 'develop' of https://github.com/great-expectations/great…

636e137

…_expectations into m/v1-290/remove_batching_regex_from_asset_sig

try to only preprocess when regex is provided through partioner

601293b

add back regex preprocessing

fd4ab11

bugfix

835fabd

fix docs tests by removing unnecessary prefix

44a647f

[pre-commit.ci] auto fixes from pre-commit.com hooks

f5d7c41

for more information, see https://pre-commit.ci

cleanup remaining tests

9b50e3d

remove comments

3f5c8ab

cleanup unnecessary fork

5d26129

add back fork for legacy tests ;(

b9e6f81

billdirks approved these changes May 16, 2024

View reviewed changes

cdkini added this pull request to the merge queue May 16, 2024

Merged via the queue into develop with commit d23d321 May 16, 2024
69 checks passed

cdkini deleted the m/v1-290/remove_batching_regex_from_asset_sig branch May 16, 2024 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAINTENANCE] Remove batching regex from FilePathDataConnector #9898

[MAINTENANCE] Remove batching regex from FilePathDataConnector #9898

joshua-stauffer commented May 7, 2024 •

edited

netlify bot commented May 7, 2024 •

edited

codecov bot commented May 9, 2024 •

edited

joshua-stauffer May 15, 2024

joshua-stauffer May 15, 2024

joshua-stauffer May 15, 2024

joshua-stauffer May 15, 2024 •

edited

joshua-stauffer May 15, 2024

joshua-stauffer May 15, 2024

joshua-stauffer May 15, 2024

billdirks May 16, 2024

billdirks left a comment

billdirks May 16, 2024

billdirks May 16, 2024

billdirks May 16, 2024

[MAINTENANCE] Remove batching regex from FilePathDataConnector #9898

[MAINTENANCE] Remove batching regex from FilePathDataConnector #9898

Conversation

joshua-stauffer commented May 7, 2024 • edited

Changes

Other things to note

netlify bot commented May 7, 2024 • edited

✅ Deploy Preview for niobium-lead-7998 canceled.

codecov bot commented May 9, 2024 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshua-stauffer May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

billdirks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshua-stauffer commented May 7, 2024 •

edited

netlify bot commented May 7, 2024 •

edited

codecov bot commented May 9, 2024 •

edited

joshua-stauffer May 15, 2024 •

edited