Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray Dataset #1486

Open
nilsleh opened this issue Jul 19, 2023 · 10 comments · May be fixed by #1490
Open

Xarray Dataset #1486

nilsleh opened this issue Jul 19, 2023 · 10 comments · May be fixed by #1490
Labels
datasets Geospatial or benchmark datasets

Comments

@nilsleh
Copy link
Collaborator

nilsleh commented Jul 19, 2023

Summary

I am working with different climate data sources that come in the form of .netcdf files and xarrays. Although, I am not an expert in that domain, it seems that this is the go to data format that is frequently used. Since there are lots of features in Torchgeo that I would like to use with this data without having to reformat to tiff files for example, I think it could be quiet powerful to add support for Xarray datasets, even though it would be another couple dependencies to add to Torchgeo.

Rationale

This could quiet possibly extend the horizon of users to other communities that work with Xarray data and benefit from all the tools Torchgeo already provides. In the majority of cases climate data also comes in the form of time-series so this would go hand in hand with the planned support for TimeSeries models and dataloading stuff in Torchgeo.

Implementation

Both in #412 and #509 there was some discussion about this, but nothing was finalized. I am definitely willing to start on this but don't have a detailed plan yet as I first wanted to gather opinions on this.

Alternatives

No response

Additional information

No response

@adamjstewart
Copy link
Collaborator

Completely agree with adding support for this, even if it means more deps. @RitwikGupta and @cjrd are our climate experts and may also have thoughts on the best way to do this. I'm not sure if/how we could support a 4th (z) dimension that frequently comes with climate data. But lets first focus on how to best handle xarray, especially when it comes to reprojection and geospatial indexing. Making a new subclass of GeoDataset that works similarly to RasterDataset will already be a big enough endeavor.

Can't remember if @isaaccorley ever worked on this before.

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Jul 19, 2023
@isaaccorley
Copy link
Collaborator

I've use it to load some netcdf files. It has good support for climate datasets and seems like it's widely used by the community.

@calebrob6
Copy link
Member

@weiji14 is another expert here (driver of #509) and is independently supporting stuff like this in zen3geo https://github.com/weiji14/zen3geo

@nilsleh nilsleh linked a pull request Jul 20, 2023 that will close this issue
@noahgolmant
Copy link

I would like to continue discussing how to best implement this @nilsleh! I am new to torchgeo, but one challenge here is the current structure of the base GeoDataset abstraction. It assumes a list of file paths to load, in our case likely via rioxarray.open_rasterio. This makes it awkward for a user to supply an arbitrary xarray dataset to __init__. I am not sure how to best support integration with other datasets that may not directly live in a filesystem, like STAC catalogs (via stackstac) or EarthEngine (via xee). We could just ignore the paths variable for now?

I also think for simplicity, it would be easiest to support a single xarray.Dataset or DataArray object with one resolution. This pushes the complexity of merging separate DataArray files back to the user, which doesn't feel too demanding. I don't think this class should do the heavy lifting of reprojecting potentially large datasets. This way, you come in with a nice datacube. In the future, it would be nice to support multiple datasets, or multi-resolution data sources, but there are other changes that would need to happen for that to work-- for example, one reviewer mentioned maintaining the image / mask pattern for the sample output dictionary. Since you can't stack multiple resolutions into one image entry, I think we'd have to deviate from that.

@adamjstewart
Copy link
Collaborator

one challenge here is the current structure of the base GeoDataset abstraction. It assumes a list of file paths to load

This is only true for RasterDataset, not GeoDataset. The only requirement is we need to create an R-tree index of bounding boxes, how that is done isn't important.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Mar 4, 2024

I will try to summarize the discussion and pain points encountered so far when we first started looking at this. Generally, one could consider a sort of similar distinction for these Grid based datasets as we have for Raster datasets.

NonGeoGridDataset

This would be something along the lines of the datacube that @noahgolmant is describing, where you have a fixed xarray cube of Time x Height x Width for several data variables (can specify which variables should be input to your model and which one would be targets) and sampling would consist of retrieving different patches from that data cube, which is just pure indexing. There exist a xpatcher or xbatcher that could serve as the basis for sampling from these datasets.

Pros:

  • can largely ignore different data conventions, file and variable namings, reprojections etc because one just draws patches from a cube

Cons:

  • isolated datacubes that are difficult to combine out of the box
  • the user does all the preprocessing up front and defines respective data variables

Examples:

  • ERA5 data for weather prediction (I think ERA5 data just comes in a datacube)

GeoGridDataset

These datasets would require building an R-tree type index (no restrictions how to achieve that) with the overarching goal of being able to combine GeoGridDatasets with all the other GeoDatasets that could include Raster, Vector data etc.

Pros:

  • if implemented (certainly more complicated but I don't think impossible), this would unlock a lot quiet powerful applications of combining data from a multitude of sensors and integrate into all the data loading capabilities that TorchGeo already offers
  • this should work with different resolutions and projections which you will encounter quiet often when trying to work with different data sources

Cons:

  • from what I can tell there are quiet a few different conventions for naming, spatial index ordering (lat long extent) and included meta data like CRS etc. that overall are not as standardized as TIF files
  • reprojections and resolutions might be slow to compute on the fly

Examples:

  • a collection of individual MODIS tiles (they come in netcdf I believe) that cover some region and I want to pair with some other sensor or other dataset that serves as target, auxiliary data etc.
  • pairing different sources of climate data (think different variables and maybe resolutions) with Landsat imagery as inputs to a model that aims to predict the Cropland Data Layer (CDL)

What we tried in the linked PR so far would fall under a GeoGridDataset which has a lot more subtleties, edge cases etc, but in principle, once solved, it would also cover all NonGeoGridDataset. There are certainly more things to cover here, but maybe this is a starting point for a layout and possible plan of attack, so feel free to criticize or extend any of these points. And as a caveat, I am also not an expert in xarray, so there could be things that I am over or under complicating.

@noahgolmant
Copy link

Thanks for the clarifications and explanations @adamjstewart and @nilsleh! @adamjstewart, here I am referring to the paths field and files property in the base GeoDataset class, which in this instance would be left unused by the subclass assuming we take in a loaded object.

@nilsleh I think it would be helpful to constrain the xarray dataset class to be a GeoDataset rather than NonGeoDataset because spatial metadata and coordinates are already required for rioxarray operations like clip and reproject. rioxarray has conventions like x/y named coordinates and a .rio.crs attribute, and it can compute the transforms/resolution from this quickly.

I think it's helpful to defer the work to set metadata and merging arrays to either (1) the user loading datasets from disk or (2) a subclass operating on fixed paths, like the other RasterDataset subclasses in this package. For example, in your original PR, combining xarray DataArrays into a single Dataset object with multiple variables might resolve some of the complexity? I can give it a try. This seems to match more closely the amount of work that RasterDataset does today-- I think the rasters are assumed to have the same set of bands for example.

The GeoGridDataset class does seem like it'd be very powerful! I'd love to see that functionality supported. It would be interesting to see the challenges that come up scaling that to larger spatiotemporal scales as well.

@noahgolmant
Copy link

noahgolmant commented Mar 4, 2024

@nilsleh here is a draft, not tested yet, curious to hear your thoughts! https://github.com/microsoft/torchgeo/compare/main...noahgolmant:torchgeo:noah/xarray?expand=1

@nilsleh
Copy link
Collaborator Author

nilsleh commented Mar 8, 2024

@nilsleh here is a draft, not tested yet, curious to hear your thoughts! https://github.com/microsoft/torchgeo/compare/main...noahgolmant:torchgeo:noah/xarray?expand=1

Cool, so of course difficult to say without tests, but from first glance it looks like it could work.

Generally, NonGeoDatasets can also have spatial metadata etc, the distinction between NonGeo and Geo is mainly how does one draw samples from a dataset. So one could approach an xarray dataset as a fixed datacube that can just be indexed through array indexing without any geo information. For example, the xpatcher libray simply defines a list of index locations that one can loop through and patches are returned from the xarray datacube via array indexing.

However, ideally everything would be a GeoDataset so I do like your approach for that. For me it helped a lot to try and find multiple common datasources and then test the dataset. The other PR also has some dummy data if you would wanna start with that. Might be easier to discuss if you open a PR for your approach.

@adamjstewart
Copy link
Collaborator

the distinction between NonGeo and Geo is mainly how does one draw samples from a dataset

It's also about whether or not two datasets can be combined via intersection/union. So if you have a benchmark dataset where you don't need to combine it, NonGeo is fine. But if it's just a single raster input layer or mask, you'll need it to be GeoDataset so you can combine with other datasets (either another xarray dataset or any other dataset as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants