Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Enable loading of list of files into dask backend #379

Open
inc0 opened this issue Oct 25, 2022 · 0 comments
Open

[FEATURE] Enable loading of list of files into dask backend #379

inc0 opened this issue Oct 25, 2022 · 0 comments

Comments

@inc0
Copy link

inc0 commented Oct 25, 2022

Is your feature request related to a problem? Please describe.
Dasks read_parquet (as well as other loaders, like csv) can accept list of files. This allows multiple dask distributed workers to load partition of data each, without data ever be loaded into client memory in it's entirety. This is important for datasets that can't fit into single machines memory.

Fugue SQL supports LOAD statement that passes this to dask backend. This statement, unfortunately, doesn't allow for list of files.

Describe the solution you'd like
Allow loading of multiple parquet files by passing a list. For example

df = LOAD ["s3://bucketname/partition1.parq", "s3://bucketname/partition2.parq"]
SELECT count(*) FROM df;

This would tell dask-distributed to load 2 files, one per worker, and perform SELECT on them in parallel.

Describe alternatives you've considered
Current workaround is passing glob statement for dask. This works for some use cases, but not all of them.

Additional context
Slack thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants