Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: implement efficient attribute and spatial filtering for datasets opened with ArrowDataset #9921

Merged
merged 14 commits into from
May 29, 2024

Conversation

rouault
Copy link
Member

@rouault rouault commented May 14, 2024

That is for Parquet datasets made of multiple files opened from a directory name, or opening a single parquet file with PARQUET:/path/to/my.parquet (if opening a single .parquet file, without PARQUET: prefixing, OGR already manually decides with row groups to select based on statistics)

This uses arrow::dataset::ScanBuilder::Filter() to translate OGR spatial and attribute filters down to the Arrow execution engine.

  1. On a Parquet 1.0 WKB file, without a geometry bounding box column:
  • Without ArrowDataset, selecting significant amount of features:
$ time ogrinfo nz-building-outlines.parquet  -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real	0m1,905s
user	0m2,128s
sys	0m0,328s
  • With ArrowDataset, selecting significant amount of features:
$ time ogrinfo PARQUET:nz-building-outlines.parquet  -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real	0m1,974s
user	0m2,297s
sys	0m1,033s
  • Without ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch nz-building-outlines.parquet  -spat 1750445 5812014 1912866 5906677

real	0m1,587s
user	0m1,737s
sys	0m0,363s
  • With ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch PARQUET:nz-building-outlines.parquet  -spat 1750445 5812014 1912866 5906677

real	0m1,489s
user	0m1,599s
sys	0m1,019s
  • Without ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo nz-building-outlines.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m1,304s
user	0m1,605s
sys	0m0,337s
  • With ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo PARQUET:nz-building-outlines.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m1,463s
user	0m1,597s
sys	0m0,989s
  • Without ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo nz-building-outlines.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m1,063s
user	0m1,277s
sys	0m0,311s
  • With ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo PARQUET:nz-building-outlines.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m1,508s
user	0m1,289s
sys	0m0,969s
  1. On a Parquet 1.1 WKB file, with a geometry bounding box column, and geometries sorted with a RTree:
  • Without ArrowDataset, selecting significant amount of features:
$ time ogrinfo nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real	0m0,995s
user	0m1,054s
sys	0m0,181s
  • With ArrowDataset, selecting significant amount of features:
$ time ogrinfo PARQUET:nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real	0m0,842s
user	0m1,237s
sys	0m0,298s
  • Without ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677

real	0m0,640s
user	0m0,671s
sys	0m0,225s
  • With ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch PARQUET:nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677

real	0m0,375s
user	0m0,771s
sys	0m0,301s
  • Without ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo nz-building-outlines_with_spi_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount1

real	0m0,310s
user	0m0,322s
sys	0m0,147s
  • With ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo PARQUET:nz-building-outlines_with_spi_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m0,210s
user	0m0,304s
sys	0m0,145s
  • Without ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo nz-building-outlines_with_spi_sorted.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m0,911s
user	0m1,267s
sys	0m0,321s

  • With ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo PARQUET:nz-building-outlines_with_spi_sorted.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m0,570s
user	0m1,339s
sys	0m0,622s

So a mix of cases where performance is (slightly) worse with ArrowDataset, to some where it is 40% faster. All of this is with 4 threads.

FYI @jorisvandenbossche @paleolimbot @kylebarron

@rouault rouault force-pushed the parquet_dataset_enhancements branch from 51e8d5b to 341e859 Compare May 14, 2024 20:26
@rouault rouault added this to the 3.10.0 milestone May 14, 2024
@rouault rouault force-pushed the parquet_dataset_enhancements branch 3 times, most recently from d28e1de to 8e63045 Compare May 15, 2024 00:06
@coveralls
Copy link
Collaborator

coveralls commented May 15, 2024

Coverage Status

coverage: 69.131% (+0.02%) from 69.108%
when pulling 1dc79e3 on rouault:parquet_dataset_enhancements
into 9b9d3e3 on OSGeo:master.

Copy link

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing. I did my best to go through the Arrow C++ in detail!

@rouault
Copy link
Member Author

rouault commented May 15, 2024

@paleolimbot Thank for the review

@rouault rouault force-pushed the parquet_dataset_enhancements branch from 8e63045 to 1dc79e3 Compare May 15, 2024 15:04
@rouault
Copy link
Member Author

rouault commented May 16, 2024

Fixes #8263

@rouault rouault merged commit 8ffaeb2 into OSGeo:master May 29, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants