How to set template and min,max value for a nested schema attribute #229

galaxy79 · 2023-08-25T23:23:51Z

Expected Behavior

I have a nested schema for the data set and want to set the value template patterns for the attributes bankAcctId,bankProduct,bankProduct,storeGroup,association,merchantId,terminalId using withColumnSpec to generate the synthetic data.

my_schema = StructType(
    [
        StructField(
            "bank",
            StructType(
                [
                    StructField("bankAcctId", StringType()),
                    StructField("bankProduct", StringType()),
                ]
            ),
        ),
        StructField(
            "merchDetails",
            StructType(
                [
                    StructField("storeGroup", StringType()),
                    StructField("association", StringType()),
                    StructField("merchantId", StringType()),
                    StructField(
                        "terminal",
                        StructType(
                            [
                                StructField("terminalId", StringType()),
                                StructField("cardholderActivatedTerm", StringType()),
                                StructField(
                                    "posInteractionTerminalEntryMode", StringType()
                                ),
                            ]
                        ),
                    ),
                ]
            ),
        ),
    ]
)

I tried the below code snippet to build the synthetic data

testDataSpec = (
    dg.DataGenerator(spark, name="test_data_set1", rows=row_count, partitions=4)
    .withIdOutput()
    .withSchema(my_schema)
)

testDataSpec = (
    testDataSpec.withColumnSpec("bank.bankAcctId", template=r"\\n-\\n")
    .withColumnSpec("merchDetails.storeGroup", template=r"\\n-\\n")
)
dfTestData = testDataSpec.build()

The code execution was failed with error

dbldatagen.utils.DataGenError: DataGenError(msg=' column `bank.bankAcctId` must refer to defined column', baseException=None)

I looking for some direction or example on how to use it.

Your Environment

Running it on mac m1 pro ( macOS venture 13.5)

dbldatagen version used:0.3.5

The text was updated successfully, but these errors were encountered:

ronanstokes-db · 2023-09-08T17:04:37Z

Hi

The way to specify how the data is generated for nested structures is to create temporary fields and generate the values for them and then combine the generated fields into the desired structure. You cant refer to a nested field in the data generation rules at present.

See the following documentation page for more information: https://databrickslabs.github.io/dbldatagen/public_docs/generating_json_data.html#generating-complex-column-data

I'll update the documentation to provide some clearer examples when creating the data using an existing schema

ronanstokes-db self-assigned this Sep 8, 2023

ronanstokes-db added the documentation Improvements or additions to documentation label Sep 8, 2023

ronanstokes-db added this to the v0.3.6 milestone Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set template and min,max value for a nested schema attribute #229

How to set template and min,max value for a nested schema attribute #229

galaxy79 commented Aug 25, 2023

ronanstokes-db commented Sep 8, 2023 •

edited

How to set template and min,max value for a nested schema attribute #229

How to set template and min,max value for a nested schema attribute #229

Comments

galaxy79 commented Aug 25, 2023

Expected Behavior

Your Environment

ronanstokes-db commented Sep 8, 2023 • edited

ronanstokes-db commented Sep 8, 2023 •

edited