Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate data based on estimated Delta table size #77

Open
tfayyaz opened this issue Aug 27, 2021 · 2 comments
Open

Generate data based on estimated Delta table size #77

tfayyaz opened this issue Aug 27, 2021 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@tfayyaz
Copy link

tfayyaz commented Aug 27, 2021

As I have been using the data generator I have had to use trial and error to get a table size I require. Not sure if this is feasible but it would be great to generate data based on the final table size required instead of no. of rows.

Alternatively it might be useful to easily get back stats about the generated table size and use that to iteratively generate more data to reach the desired table size.

Currently I am doing the following to get back the table size which works well but needs to be run manually each time.

dfTestData.write.format("delta").mode("overwrite").saveAsTable("tfayyaz_db.test_data")
detail = spark.sql("DESCRIBE DETAIL tfayyaz_db.test_data")
print(detail.first()["sizeInBytes"]/1024/1024, "mb")

Thanks
Tahir

@ronanstokes-db ronanstokes-db self-assigned this Sep 2, 2021
@ronanstokes-db
Copy link
Contributor

ronanstokes-db commented Sep 2, 2021

i'm thinking that we could do something like the following:

add method `withTargetSize(sizeMb,
dataFormat,
tmpPath (optional tmp path override))

we would then generate 100,000 rows and write them to tmp location (default dbfs:/tmp) in format (default parquet)

Then we could use the size of 100k rows to compute how many rows will make up the target size.

The row count used for the calculations could be configurable also.

As size of result set will be a guestimate (due to possible random data), this will produce an approximately sized data output.

Note if size of data per row is not very variable (i.e limited arbitrary text generation), this should produce a dataset close to the target size.

Key question: Do you need exact target size or approximate sized outputs?

@ronanstokes-db
Copy link
Contributor

For accurate target sizes, you would need to produce a data set larger than the target size and sample / write repeatedly until you get close to the target size.

For large data sets, this could be costly in terms of performance

@ronanstokes-db ronanstokes-db added this to the post-release milestone Sep 20, 2021
@ronanstokes-db ronanstokes-db added the enhancement New feature or request label Sep 20, 2021
@ronanstokes-db ronanstokes-db modified the milestones: post-release, v0.3.0 Nov 4, 2022
@ronanstokes-db ronanstokes-db modified the milestones: v0.3.0, v0.4.0 Dec 1, 2022
@ronanstokes-db ronanstokes-db removed this from the v0.4.0 milestone Mar 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants