-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kwarg vsi to RasterDataset to support GDAL VSI #1399
base: main
Are you sure you want to change the base?
Add kwarg vsi to RasterDataset to support GDAL VSI #1399
Conversation
@microsoft-github-policy-service agree company="vake" |
Hey, sorry it took us so long to review this! We're normally much more responsive to issues/PRs but we just finished a paper deadline. This looks great! We've been wanting better support for cloud (AWS/GCP/Azure) for a long time. The biggest thing holding that up is my own perfectionism (what about VectorDataset or NonGeoDataset?). There isn't an easy way to add support for cloud for every dataset without refactoring the entire library. I was considering doing this at the same time as porting to TorchData, but we're still on the fence about when or if that will actually happen. So in my opinion, starting with Raster/VectorDataset for now and worrying about the others should be fine. In terms of the current implementation, here are my preliminary thoughts:
@calebrob6 will likely want to check this to make sure this works for him on Azure/Planetary Computer. P.S. Thanks for the PR! This would be a fantastic first contribution. |
As Sean mentioned in my Fiona feature request maybe libcloud is a good option. But we could start with the simple
Looks like the behaviour will be the same. Everything that can be opened using fiona/rasterio/gdal will support vsi.
Looks like rasterio does not have this method. Do you think the above mentioned
This (should) work with the cloud providers listen in the gdal docs. fiona.listdir use method VSIReadDir from gdal/ogr internally, which I assume is thoroughly tested. I have tested my implementation on Azure and GS.
Looks like moto can mock virtual file systems. |
We're trying to limit additional dependencies but I'll keep libcloud and moto in mind. I agree that we should start with a simple implementation and worry about making it "perfect" later. Especially since it wouldn't be an API change if we modify the implementation later.
|
We had previously discussed possibly making a This also has the benefit of making the dependencies optional and only imported inside the method. |
We should definitely have a This is somewhat orthogonal to this PR. We should add @adriantre would you be interested in implementing |
I will give it a shot. I'm initially thinking that, analogously to |
The vsi can be passed in two ways. E.g. for azure:
Looks like 2 is preferred, at least somewhat documented, although internally 2. is converted to 1. before it is passed to the reader. For either we would need to use I realised that we cannot use Lastly, another vsi complicating the matter is zip-archives, which I'm starting to feel like Thoughts? |
It's certainly possible, but as you mention, the scope makes it difficult. I think |
I think |
3237d5f
to
6624dee
Compare
I have now refactored this and rebased on #1442, so all changes from that PR is reflected here until that is merged. |
6624dee
to
7ca3cb7
Compare
Is now outdated wrt. to #1442 but still works as proof of concept. Any feedback? |
from rasterio._path import SCHEMES | ||
|
||
prefix = path.split("://")[0] | ||
schemes = prefix.split("+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, because vsi can be chained, e.g. zip+az://...
for zip-files in azure.
I'm hesitant to add too much complexity if #1442 supports something similar. I would say let's keep pushing on #1442 and once that's done we can decide whether this is functionality we need or whether the user should be responsible for calling this function themselves and passing the input as a list of paths to the dataset. |
All right, I have an idea. In #1442 I have an else statement in torchgeo/torchgeo/datasets/geo.py Line 452 in 118ab4f
This at least remove the blockers / need of overriding the whole |
Is that logic needed for remote files or only for remote directories? I was under the impression that if a user implemented a method to find remote files, then GDAL could open those without issue. |
Indeed, gdal can open them if they reach |
We don't need to check |
Fix #1398
Fix #1165
This PR remove blockers for reading raster files directly from Cloud Storage (buckets) and other GDAL Virtual File Systems.