Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Docs in Azure ADLS $web subdirectory not working #9632

Open
damczyk opened this issue Mar 18, 2024 · 0 comments
Open

Data Docs in Azure ADLS $web subdirectory not working #9632

damczyk opened this issue Mar 18, 2024 · 0 comments
Labels

Comments

@damczyk
Copy link

damczyk commented Mar 18, 2024

Describe the bug
Please see also https://discourse.greatexpectations.io/t/data-docs-in-azure-adls-web-subdirectory-not-working/1627

Hello together,

Environment
Operating system: Databricks + Azure ADLS Gen2
Version numbers:
Programming language: Python
Problem Statement:
I was trying to do the following:
I configured my GX in a way that I’m able to store my GX artifacts (GX Expectations, GX Checkpoint etc.) in an ADLS Gen2 project container. Second I wanted to store my GX data docs from executed checkpoints in as a static website in the $web container. Because of several other projects are using the same Azure storage account I wanted to put my data docs in $web/ct10/ subdirectory (ct10 is my project), expecting that the main page index.html will show a list of executed checkpoints results. This works btw when I use the root dir $web instead of $web/ct10/

Instead, this is happening:
After executing some checkpoints index.html shows no list
image

Reproducing the issue: Details necessary to enable us to reproduce the issue and provide a quick resolution. E.g.: What commands or code did you run and/or what actions did you take that led to the issue?Steps to reproduce the behavior could look something like the following:

I ran this command or piece of code ‘…’
My code:


context_root_dir = f"/dbfs{project.MNT_PATH}/DataQuality/GX/"

project_config = gx.data_context.types.base.DataContextConfig(
    ## Local storage backend
    store_backend_defaults=gx.data_context.types.base.FilesystemStoreBackendDefaults(
      root_directory=context_root_dir
    ),
    ## Data docs site storage
    data_docs_sites={
      "az_site": {
        "class_name": "SiteBuilder",
        "store_backend": {
          "class_name": "TupleAzureBlobStoreBackend",
          "container":  "\$web",
          "prefix": "ct10/",
          "connection_string":  "DefaultEndpointsProtocol=https;AccountName=<storage_account>>;AccountKey=<key>;EndpointSuffix=core.windows.net",
        },
        "site_index_builder": {
          "class_name": "DefaultSiteIndexBuilder",
          "show_cta_footer": True,
        },
      }
    },
  )

context = gx.get_context(project_config=project_config)

data_source_name = f"{get_hive_table_db_name()}".lower()
  data_asset_name = f"{get_product_name()}".lower()
  batch_request = (context
    .sources
    .add_or_update_spark(name=data_source_name)
    .add_dataframe_asset(name=data_asset_name, dataframe=df_Staging)
    .build_batch_request()
  )

# Run the default onboarding profiler on the batch request
    onboarding_data_assistant_result = (context
    .assistants
    .onboarding
    .run(
        batch_request=batch_request,
        exclude_column_names=[],
        estimation="flag_outliers", # default: "exact"
    )
    )

    # Get the suite with specific name
    onboarding_suite_name = "onboarding_"+data_source_name+"_"+data_asset_name
    onboarding_suite = (onboarding_data_assistant_result
    .get_expectation_suite(
        expectation_suite_name=onboarding_suite_name
    )
    )

    # Perist expectation suite with the specified suite name from above
    context.add_or_update_expectation_suite(expectation_suite=onboarding_suite)

    onboarding_checkpoint_name="onboarding_"+data_source_name+"_"+data_asset_name

    # Create and persist checkpoint to reuse for multiple batches
    context.add_or_update_checkpoint(
        name = onboarding_checkpoint_name,
        config_version = 1,
        class_name = "SimpleCheckpoint",
        validations = [
            {"expectation_suite_name": onboarding_suite_name}
        ]
    )

# Run Onboarding checkpoint
  checkpoint_result = context.run_checkpoint(
    checkpoint_name=onboarding_checkpoint_name,
    batch_request=batch_request
  )

Additional information:
At least documentation was created. But only reachable via deep-link
image

Newest release was used 0.18.10

Expected behavior
I would expect to use sub-directories and .html files are written as content-type "text/html" in the $web container on ADLS Gen2

Additional context
When I use TupleFilesystemStoreBackend instead of TupleAzureBlobStoreBackend
Data Docs are written to the wanted subfolder but as content-type application/octet-stream, which can not be opened directly in a browser (needs to be downloaded and opened locally). So that's also no option.

Using

data_docs_sites={
      "az_site": {
        "class_name": "SiteBuilder",
        "store_backend": {
          "class_name": "TupleAzureBlobStoreBackend",
          "container":  "\$web/subfolder",
          "connection_string":  "DefaultEndpointsProtocol=https;AccountName=<storage_account>>;AccountKey=<key>;EndpointSuffix=core.windows.net",
        },
        "site_index_builder": {
          "class_name": "DefaultSiteIndexBuilder"
        },
      }
    },

files are written in /subfolder subdirectory! And .html-files are in content-type text/html!
But unfortunately I get an error at the end
and no index.html is generated in $web/subfolder

HttpResponseError: The requested URI does not represent any resource on the server.
RequestId:e02116b8-601e-0035-66ef-6a7eae000000
Time:2024-02-29T09:11:25.6007194Z
ErrorCode:InvalidUri
Content: <?xml version="1.0" encoding="utf-8"?>
<Error><Code>InvalidUri</Code><Message>The requested URI does not represent any resource on the server.
RequestId:e02116b8-601e-0035-66ef-6a7eae000000
Time:2024-02-29T09:11:25.6007194Z</Message></Error>

Using setting

"filepath_prefix": "subfolder/",

I get errors like described here:
https://discourse.greatexpectations.io/t/data-docs-in-azure-adls-web-subdirectory-not-working/1627/6?u=hdamczy

@Kilo59 Kilo59 added the azure label May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants