Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray Dataset Support #1490

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Xarray Dataset Support #1490

wants to merge 20 commits into from

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Jul 20, 2023

This PR aims to add support for dataloading based on xarray datasets as the discussion began in #1486.

The application I have for my project is taking in CMIP6 data from separate files that each define a climate variable and I need to extract sampled time-series. The sampling procedure can come in form of certain samplers as I started working on in #877 .

The current draft is based on the closed #509 , but instead of a single xr.DataArray takes in a directory to gather files like RasterDataset does.

The climate data for example does not have a CRS but they do come in different resolutions. Additionally, it could be that some data is time-series but others is not, so this needs to be merged correctly. Other climate data could also have a "depth" component.

Since there are a lot of things to consider, I would definitely appreciate input from more experienced users that have a better scope of common required work flows.

Closes #1486

@nilsleh nilsleh marked this pull request as draft July 20, 2023 14:04
@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Jul 20, 2023
@adamjstewart adamjstewart added this to the 0.5.0 milestone Jul 20, 2023
@nilsleh nilsleh changed the title Xarray Dataset Xarray Dataset Support Jul 21, 2023
@adamjstewart
Copy link
Collaborator

We discussed this over zoom, but my basic comments were something along the lines of:

  • This base class should be generic enough to support any NetCDF-based dataset
  • Most NetCDF files won't have a default CRS, we should default to lat/long (EPSG 4326 I believe)
  • Not sure if files have a way of expressing their bounds, we should default to the entire Earth
  • Subclasses can override the defaults to change to a different CRS or extent. This may require a custom __getitem__ that parses the extent from a particular layer

Other than that, it's generally difficult to design a useful base class without lots of examples. I didn't write RasterDataset until we already had several examples of the same pattern in TorchGeo. The base class was an attempt to reduce code duplication and share some of the more complicated data access stuff. I think it's fine to design this base class with the few example datasets you have in mind and change it to add new features later.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Jul 28, 2023

I tested the xarray case I have with time-series indexing from #877 and I think this requires a discussion about assumptions about dateformats which is a bit tricky. Currently, the assumption is that the index and the sampler is working with datetime.timestamps() and that the xarray dataset are then indexed with a datetime object.

@github-actions github-actions bot added the testing Continuous integration testing label Jul 28, 2023
@nilsleh nilsleh marked this pull request as ready for review July 28, 2023 12:02
@github-actions github-actions bot added the dependencies Packaging and dependencies label Jul 28, 2023
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difficult to review without several examples of what these datasets look like in real life. Maybe do a quick literature review and find a few good examples of popular NetCDF datasets from different sources? I want to see the diversity of file metadata. Like, do all datasets encode CRS/res/time/bounds in the same way or are they all different? This greatly influences how much customizability the base class requires.

requirements/datasets.txt Show resolved Hide resolved
@@ -31,6 +31,7 @@ radiant-mlhub==0.3.0
rarfile==4.0
scikit-image==0.18.0
scipy==1.6.2
xarray
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to determine the minimum version that works before merging

torchgeo/datasets/rioxarray.py Outdated Show resolved Hide resolved
tests/data/rioxarray/data.py Outdated Show resolved Hide resolved
from datetime import datetime
from typing import Any, Callable, Optional, cast

import netCDF4 # noqa: F401
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used at the moment and could probably be removed

Copy link
Collaborator Author

@nilsleh nilsleh Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into this issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather ignore that warning in pyproject.toml than add a fake import

f"query: {query} not found in index with bounds: {self.bounds}"
)

data_arrays: list["np.typing.NDArray[np.float32]"] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will they always be float32?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there could be cases where you also have integers but I would expect most datasets to have float values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to keep it dynamic if we can't predict 100% of the time.

for item in items:
with xr.open_dataset(item, decode_cf=True) as ds:
if not ds.rio.crs:
ds.rio.write_crs(self._crs, inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rioxarray automatically reproject to the right CRS if files are in multiple or if the user chooses a different CRS than the default (or if IntersectionDataset changes it)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Point, I think it does not automatically reproject. So far I have only used climate data which doesn't explicitly encode or use a CRS, I should check with some MODIS files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am just doing this with the climate data, where there is no explicit CRS and ds.rio.reproject() also only works for 2D and 3D arrays, whereas the CMIP data I have has more dimensions so I get rioxarray.exceptions.TooManyDimensions: Only 2D and 3D data arrays supported. Data variable: tos.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, trying it with MODIS files which are .hdf files, I can only open them with rioxarray.open_rasterio() and not xr.open_dataset(engine="rasterio") so maybe one base class is too ambitious and ugly to support climate and satellite data at once.

if hasattr(clipped, variable):
data_arrays.append(clipped[variable].data.squeeze())

sample = {"image": torch.from_numpy(np.stack(data_arrays)), "bbox": query}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support both images and masks. See the is_image attribute in RasterDataset for how we do this there. You can also copy the dtype property to automatically choose what dtype to cast to.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this works correctly for multiple overlapping files. We shouldn't be stacking, we should be merging.

Copy link
Collaborator Author

@nilsleh nilsleh Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, maybe have to rethink the base class thing and have one for xarray climate type data that is not using crs explicitly (class XarrayDataset(GeoDataset)) and one that is intended for crs depending data sources like MODIS and more similar to RasterDataset and using rioxarray so the current naming class RioXarrayDataset(GeoDataset) but with the required functionality to handle overlapping files etc.

os.makedirs(DIR)
for var_name in VAR_NAMES:
for lats, lons, cf_time in zip(LATS, LONS, CF_TIME):
path = os.path.join(DIR, f"{var_name}_{lats}_{lons}.nc")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never thought about the possibility of the filename containing the bounds/res/crs. Have you seen this in the wild? If so, we could add a check for this in the regex and extract it from the filename like we do for time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this was just a dummy naming scheme for test data. Maybe we should explicitly create test data for some of the different data cases like CMIP, ERA5, MODIS and others and write test cases for those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that once we have subclasses for each of those datasets

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend renaming this file. Otherwise the following become very different things:

import rioxarray
import .rioxarray

Maybe call it rioxr.py? Or just throw it in geo.py with the other base classes.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Aug 17, 2023

Difficult to review without several examples of what these datasets look like in real life. Maybe do a quick literature review and find a few good examples of popular NetCDF datasets from different sources? I want to see the diversity of file metadata. Like, do all datasets encode CRS/res/time/bounds in the same way or are they all different? This greatly influences how much customizability the base class requires.

This is probably not exhaustive but I tried finding some sources from climate data and satellite optical data to cover some bases.

  • Xarray supports these conventions for weather and climate data as suggested on their docs.
  • For a tutorial notebook that works with the climate data that I am also using maybe this notebook gives insights into the meta data
  • This is a notebook working with ERA reanalysis data which is also very popular
  • this and this are examples of xarray data from MODIS

I suppose it could potentially be useful to have sub classes for these specific modalities, meaning a class CMIPData(), class ERA5Data() etc. MODIS has a PR open but should be refactored. That would be similar to how RasterDataset has Landsat, Sentinel etc. as sub classes.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Aug 29, 2023

I found one approach that merges the data back again, but I think it can be quiet slow potentially. Might want to explore that.

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are failing. Also curious if we can add a dep on only xarray or rioxarray but not both.

torchgeo/datasets/rioxr.py Show resolved Hide resolved
torchgeo/datasets/rioxr.py Outdated Show resolved Hide resolved
@nilsleh
Copy link
Collaborator Author

nilsleh commented Sep 11, 2023

Currently getting a rasterio.errors.WindowError: Bounds and transform are inconsistent error with the synthetic test data and trying to figure out why that is to fix the failing test. There should also be more diverse test cases given the variety of available datasets.

@adamjstewart adamjstewart removed this from the 0.5.0 milestone Sep 28, 2023
@adamjstewart adamjstewart mentioned this pull request Nov 26, 2023
4 tasks
@nilsleh
Copy link
Collaborator Author

nilsleh commented Feb 2, 2024

Potentially interesting utility for data loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets dependencies Packaging and dependencies testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Xarray Dataset
2 participants