[FEATURE] Enable loading of list of files into dask backend #379

inc0 · 2022-10-25T21:38:36Z

Is your feature request related to a problem? Please describe.
Dasks read_parquet (as well as other loaders, like csv) can accept list of files. This allows multiple dask distributed workers to load partition of data each, without data ever be loaded into client memory in it's entirety. This is important for datasets that can't fit into single machines memory.

Fugue SQL supports LOAD statement that passes this to dask backend. This statement, unfortunately, doesn't allow for list of files.

Describe the solution you'd like
Allow loading of multiple parquet files by passing a list. For example

df = LOAD ["s3://bucketname/partition1.parq", "s3://bucketname/partition2.parq"]
SELECT count(*) FROM df;

This would tell dask-distributed to load 2 files, one per worker, and perform SELECT on them in parallel.

Describe alternatives you've considered
Current workaround is passing glob statement for dask. This works for some use cases, but not all of them.

Additional context
Slack thread

The text was updated successfully, but these errors were encountered:

goodwanghan added this to the 0.7.5 milestone Nov 7, 2022

goodwanghan modified the milestones: 0.7.5, 0.8.0 Nov 17, 2022

goodwanghan removed this from the 0.8.0 milestone Jan 5, 2023

goodwanghan mentioned this issue Mar 25, 2023

Add coarse partitioning, enable loading multiple files fugue-project/fugue-sql-antlr#18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Enable loading of list of files into dask backend #379

[FEATURE] Enable loading of list of files into dask backend #379

inc0 commented Oct 25, 2022

[FEATURE] Enable loading of list of files into dask backend #379

[FEATURE] Enable loading of list of files into dask backend #379

Comments

inc0 commented Oct 25, 2022