Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set template and min,max value for a nested schema attribute #229

Open
galaxy79 opened this issue Aug 25, 2023 · 1 comment
Open
Assignees
Labels
documentation Improvements or additions to documentation
Milestone

Comments

@galaxy79
Copy link

Expected Behavior

I have a nested schema for the data set and want to set the value template patterns for the attributes bankAcctId,bankProduct,bankProduct,storeGroup,association,merchantId,terminalId using withColumnSpec to generate the synthetic data.

my_schema = StructType(
    [
        StructField(
            "bank",
            StructType(
                [
                    StructField("bankAcctId", StringType()),
                    StructField("bankProduct", StringType()),
                ]
            ),
        ),
        StructField(
            "merchDetails",
            StructType(
                [
                    StructField("storeGroup", StringType()),
                    StructField("association", StringType()),
                    StructField("merchantId", StringType()),
                    StructField(
                        "terminal",
                        StructType(
                            [
                                StructField("terminalId", StringType()),
                                StructField("cardholderActivatedTerm", StringType()),
                                StructField(
                                    "posInteractionTerminalEntryMode", StringType()
                                ),
                            ]
                        ),
                    ),
                ]
            ),
        ),
    ]
)

I tried the below code snippet to build the synthetic data

testDataSpec = (
    dg.DataGenerator(spark, name="test_data_set1", rows=row_count, partitions=4)
    .withIdOutput()
    .withSchema(my_schema)
)

testDataSpec = (
    testDataSpec.withColumnSpec("bank.bankAcctId", template=r"\\n-\\n")
    .withColumnSpec("merchDetails.storeGroup", template=r"\\n-\\n")
)
dfTestData = testDataSpec.build()

The code execution was failed with error

dbldatagen.utils.DataGenError: DataGenError(msg=' column `bank.bankAcctId` must refer to defined column', baseException=None)

I looking for some direction or example on how to use it.

Your Environment

Running it on mac m1 pro ( macOS venture 13.5)

  • dbldatagen version used:0.3.5
@ronanstokes-db ronanstokes-db self-assigned this Sep 8, 2023
@ronanstokes-db ronanstokes-db added the documentation Improvements or additions to documentation label Sep 8, 2023
@ronanstokes-db ronanstokes-db added this to the v0.3.6 milestone Sep 8, 2023
@ronanstokes-db
Copy link
Contributor

ronanstokes-db commented Sep 8, 2023

Hi

The way to specify how the data is generated for nested structures is to create temporary fields and generate the values for them and then combine the generated fields into the desired structure. You cant refer to a nested field in the data generation rules at present.

See the following documentation page for more information: https://databrickslabs.github.io/dbldatagen/public_docs/generating_json_data.html#generating-complex-column-data

I'll update the documentation to provide some clearer examples when creating the data using an existing schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants